[01:58:03] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson Hi @jcrespo - the host is racked, and the ETA for completion by @Cmjohnson and @RobH is next Wednesday... [02:00:32] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson Confirmed with @Cmjohnson and @RobH today, that these es1026-1034 hosts will be ready for you by end of Octo... [05:14:00] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 58.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [05:20:26] PROBLEM - MariaDB sustained replica lag on db1143 is CRITICAL: 9.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1143&var-port=9104 [05:22:02] RECOVERY - MariaDB sustained replica lag on db1143 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1143&var-port=9104 [05:23:40] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [08:24:20] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can... [08:46:11] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can... [08:53:35] 10DBA, 10Goal, 10Patch-For-Review: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10jcrespo) After setup of the new hardware, and deployment of the new backup scheduler, a full backup run took from 2020-0... [08:59:36] 10DBA, 10Goal, 10Patch-For-Review: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10jcrespo) CC @LSobanski ^ 8.1 terabytes were backed up from live databases, prepared, gathered metadata for each file, t... [09:04:48] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can... [09:31:07] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2125.codfw.wmnet'] ` and were **ALL** successful. [09:35:31] kormat: there were a few sustained lag alerts tonight: https://bots.wmflabs.org/logs/%23wikimedia-databases/20200925.txt but on eqiad hosts, expected? I didn't have a deeper look [09:35:53] sorry, I just saw you are afk [10:09:54] 10DBA, 10Goal, 10Patch-For-Review: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10jcrespo) Similarly, previously, dump time (not having into account s4 section) got reduced from 2020-08-18 00:00:02 to 2... [10:23:58] I restarted es1025 ferm systemd unit, for some reason it had failed [10:35:14] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 12 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [10:36:30] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [10:38:31] here it goes again [10:50:14] I would like to experiment with temporal data tables https://mariadb.com/kb/en/temporal-data-tables/ [10:50:27] not for production, but maybe on the source backup hosts [10:51:04] having a fast way to access pervious versions of a table would be a powerful tool for disaster recovery (easier than binlogs) [10:51:09] *previous [10:51:50] I wonder if it would explode in performance for some tables [11:04:22] jynus: experiment now or in general? [11:43:34] jynus: kormat: That's our page [11:48:01] jynus: kormat: marostegui: duplicate entry SQL lag pages [12:10:25] 10DBA, 10Operations: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) [12:10:33] 10DBA, 10Operations: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) p:05Triage→03Unbreak! [12:11:02] 10DBA, 10Operations: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) This is being internally tracked as there is some PII, but feel free to use this task for updates from the SRE team [12:12:05] 10DBA, 10Operations: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) List of affected wikis ` apiportalwiki avkwiki cebwiki dewiki enwikivoyage jawikivoyage lldwiki mgwiktionary mhwiktionary muswiki shwiki srwiki thankyouwiki ` [12:17:39] 10DBA, 10Operations: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10LSobanski) DBA are testing a recovery action prior to applying it broadly. [12:18:38] 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Majavah) [12:20:04] PROBLEM - MariaDB sustained replica lag on db2137 is CRITICAL: 1176 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13315 [12:24:52] RECOVERY - MariaDB sustained replica lag on db2137 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13315 [12:27:10] 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10LSobanski) A fix was applied and users of affected wikis should be seeing recovery now. [12:29:14] 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) p:05Unbreak!→03High This shoud be fixed now for end-users. removing unbreak now. Please report any strange things you may find (should be... [12:50:01] 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) a:03jcrespo This needs research, it is weird this happened, specially after T260042 was done prior to switchover. [13:32:40] I asked manuel privatelly, and the issue is probably pre-switchover [13:32:46] which is good news [13:33:26] I mean, not something we would ask for, but means that nothing broke during codfw primary, it was probably a pre-existing issue [13:35:33] sobanski: re: experiment now or in general? "at some point in the future" [13:36:33] as in "new cool tech that could be useful for some areas, but nobody is familiar with it" [13:37:02] specially instant recovery of a table at an arbitrary time in the past [13:44:52] kormat: how's db2125 behaving? Still giving issues :-(? [13:45:39] jynus: it got a new motherboard yesterday. i've reimaged it, and am currently restoring the db from backups. so hopefully i can have it run over the weekend and we'll see [13:46:00] oh, so that's better than I though, I thought hw was still missbehaving [13:46:12] hopefully all hw woes are behind us! [13:49:39] indeed :) [13:52:29] my preliminary research tells me that the error was that the primary instance and the 4 other hosts lacked a row on enwikivoyage.ipblocks [13:52:49] if the right "state" was with the row or without it, not sure if it is easy to say [13:53:04] but I would say by majority rule, it should be with it [13:54:21] I may have to talk to 2 admins on enwikivoyage to undersand what is the desired state of blocks [14:13:30] so this is the weird thing- the 2 conflicting rows were, one created on 09:05 and the other on 09:39/09:47 so unless this was caused by the inconsistency somewhere else, it may be an application issue? [14:13:57] but it wouldn't make sense it only applies to some servers [14:14:06] must be some weird data dependency [14:14:28] These times are UTC today? [14:14:33] yes [14:14:56] the problem is writes are complex, they can fail because a unique key depending on another, older row [14:15:30] I am talking now with the admins that did the block to put the data back in the desired user state [14:15:38] so we are in a healthy place [14:15:50] and we can restart eqiad replication [14:39:56] jynus: ohh, transferpy. thanks for mentioning that. i've been paying attention to wmfbackups with what i'm currently doing to make sure i don't break that, but i forgot about transferpy :) [14:40:27] I don't think transfer.py uses it [14:40:35] but I can do a quick grep [14:40:57] I think it uses wmfmariadb.remote maybe? [14:41:01] even if it doesn't use this, i need to remember to keep it in mind [14:41:05] i think so, yeah [14:41:14] this is the downside of split repos [14:41:24] if it was one, you could just do everthing on your own [14:41:35] on the bad side, it requires some coordination [14:41:38] yeah plusses and minuses [14:41:41] yep [14:41:52] not a big minus, but something to take in mind [14:41:58] also this was bound to happen [14:42:02] the api was very green [14:42:08] not preciselly very stable [14:42:32] but it is the price to pay to not duplicate code! [14:42:42] yeah :) [14:44:08] Maybe we could start generating some sort of a dependency graph while it's still early [14:44:37] i'll grab a napkin! [14:45:14] npkn.io - the cloud service to write notes you will certainly lose before you need them [14:45:30] :D [14:45:46] Or maybe napk.in [14:46:09] Well, that already exists [14:46:21] Not as awesome as my idea though [14:51:36] jynus: re: the db sustained lag alerts this morning, it looks like the same deal - spike of writes to s4 master, replication struggling a bit to catch up [14:51:46] interesting [14:52:02] in the past we had lots of lag due to boths doing huge number of writes [14:52:28] but this is another manifestation- spikes happening here and there, but not for a long time [14:52:47] as long as there is not primary dc lag, there is not an issue [14:53:09] but I wonder how perf will handle cross-dc lag if they want active-active [15:19:07] kormat: if around, I have a long term fix for enwikivoyage, but I would like some support [15:19:10] in case things break [15:19:32] I am going to drop the row on eqiad with replication [15:19:49] then run a drop for both old and new rows on the real master [15:20:08] and finally insert the last version of the rows, as agreed with contributors [15:20:22] I need someone to check if alerts of replication issues arise, that are unexpected [15:21:09] jynus: do you think this is something that needs to happen today, or could it wait until monday? [15:21:16] it has to be today [15:21:19] (just because it's friday evening) [15:21:20] beacuse if not [15:21:29] it will happen during the weekend [15:22:08] when there is user activity on that table again (things are inconsistent right now) [15:22:16] ah, i see [15:22:40] I am not doing for pleasure :-D [15:22:42] ok, i'm around now anyway [15:22:43] *it [15:23:53] I just logged [15:24:01] proceeding with eqiad change [15:26:38] starting replication on db1100 [15:27:14] waiting a second there is no fallout on eqiad before prociding with the cluster wide deletes [15:27:33] looks fine so far [15:28:02] I will now do the master deletes and reinsert to go back to consistency [15:28:45] no issues so far, right? [15:29:06] (lag on eqiad, sure, but no breakage? [15:29:07] ) [15:29:27] kormat: ? [15:30:18] correct, as far as i can see [15:30:33] ok, then doing the consistency delete + insert [15:31:20] on db2123 [15:31:33] with replication [15:33:03] done [15:33:36] hopefully all nodes like that, and we are back to a consistent state (that will get us through the weekend) [15:33:56] no errors so far [15:34:12] now we just have to wait that eqiad also likes it after it catches up [15:35:41] looking good, and there should be no issues with autoincrement [15:37:23] (so just to clarify, the issue wasn't eqiad, which could be stopped for the weekend, but the discrepancy still between codfw servers) [15:37:59] everthing looking good, I will be around for a while [15:38:23] but I think we are out of danger [15:38:37] (gotcha) [15:38:40] cool :) [15:39:36] and now running db-compare to verify everthing is fine [15:45:06] https://phabricator.wikimedia.org/P12796#71122 yay [15:45:25] nice! [15:52:08] The rare case where nothing = good [15:55:33] he he, we have lots of tools where no output and exit code == 0 is what we want [16:01:45] 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) p:05High→03Medium After discussing proposed fix of table inconsistency with enwikivoyage admins, an old block, that was only applied on c... [16:20:40] 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10RhinosF1) > After discussing proposed fix of table inconsistency with enwikivoyage admins Was this public anywhere for the sake of transparency? Could... [16:22:25] 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) > Was this public anywhere for the sake of transparency? Could a log / page be linked to? Yes, it was on their Village pump. https://en.wiki... [16:23:44] 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10RhinosF1) Thanks for the quick reply @jcrespo [16:26:26] 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) Effectively no block was applied or removed by me, only metadata was made consistent by "merging" 2 other partially applied blocks. Logs wher... [18:52:35] 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) >>! In T263842#6493968, @akosiaris wrote: > List of affected wikis > > ` > apiportalwiki > avkwiki > cebwiki > dewiki > enwikivoyage > ja... [19:02:01] 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) >>! In T263842#6494987, @Marostegui wrote: >>>! In T263842#6493968, @akosiaris wrote: >> List of affected wikis >> >> ` >> apiportalwiki...