[01:58:03] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson Hi @jcrespo - the host is racked, and the ETA for completion by @Cmjohnson and @RobH is next Wednesday...
[02:00:32] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson Confirmed with @Cmjohnson and @RobH today, that these es1026-1034 hosts will be ready for you by end of Octo...
[05:14:00] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 58.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104
[05:20:26] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db1143 is CRITICAL: 9.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1143&var-port=9104
[05:22:02] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db1143 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1143&var-port=9104
[05:23:40] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104
[08:24:20] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can...
[08:46:11] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can...
[08:53:35] <wikibugs>	 10DBA, 10Goal, 10Patch-For-Review: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10jcrespo) After setup of the new hardware, and deployment of the new backup scheduler, a full backup run took from 2020-0...
[08:59:36] <wikibugs>	 10DBA, 10Goal, 10Patch-For-Review: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10jcrespo) CC @LSobanski ^  8.1 terabytes were backed up from live databases, prepared, gathered metadata for each file, t...
[09:04:48] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can...
[09:31:07] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2125.codfw.wmnet'] `  and were **ALL** successful.
[09:35:31] <jynus>	 kormat: there were a few sustained lag alerts tonight: https://bots.wmflabs.org/logs/%23wikimedia-databases/20200925.txt but on eqiad hosts, expected? I didn't have a deeper look
[09:35:53] <jynus>	 sorry, I just saw you are afk
[10:09:54] <wikibugs>	 10DBA, 10Goal, 10Patch-For-Review: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10jcrespo) Similarly, previously, dump time (not having into account s4 section) got reduced from 2020-08-18 00:00:02 to 2...
[10:23:58] <jynus>	 I restarted es1025 ferm systemd unit, for some reason it had failed
[10:35:14] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 12 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104
[10:36:30] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104
[10:38:31] <jynus>	 here it goes again
[10:50:14] <jynus>	 I would like to experiment with temporal data tables https://mariadb.com/kb/en/temporal-data-tables/
[10:50:27] <jynus>	 not for production, but maybe on the source backup hosts
[10:51:04] <jynus>	 having a fast way to access pervious versions of a table would be a powerful tool for disaster recovery (easier than binlogs)
[10:51:09] <jynus>	 *previous
[10:51:50] <jynus>	 I wonder if it would explode in performance for some tables
[11:04:22] <sobanski>	 jynus: experiment now or in general?
[11:43:34] <sobanski>	 jynus: kormat: That's our page
[11:48:01] <akosiaris>	 jynus: kormat: marostegui: duplicate entry SQL lag pages
[12:10:25] <wikibugs>	 10DBA, 10Operations: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris)
[12:10:33] <wikibugs>	 10DBA, 10Operations: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) p:05Triage→03Unbreak!
[12:11:02] <wikibugs>	 10DBA, 10Operations: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) This is being internally tracked as there is some PII, but feel free to use this task for updates from the SRE team
[12:12:05] <wikibugs>	 10DBA, 10Operations: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) List of affected wikis  ` apiportalwiki avkwiki cebwiki dewiki enwikivoyage jawikivoyage lldwiki mgwiktionary mhwiktionary muswiki shwiki srwiki thankyouwiki `
[12:17:39] <wikibugs>	 10DBA, 10Operations: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10LSobanski) DBA are testing a recovery action prior to applying it broadly.
[12:18:38] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Majavah)
[12:20:04] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db2137 is CRITICAL: 1176 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13315
[12:24:52] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db2137 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13315
[12:27:10] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10LSobanski) A fix was applied and users of affected wikis should be seeing recovery now.
[12:29:14] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) p:05Unbreak!→03High This shoud be fixed now for end-users. removing unbreak now. Please report any strange things you may find (should be...
[12:50:01] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) a:03jcrespo This needs research, it is weird this happened, specially after T260042 was done prior to switchover.
[13:32:40] <jynus>	 I asked manuel privatelly, and the issue is probably pre-switchover
[13:32:46] <jynus>	 which is good news
[13:33:26] <jynus>	 I mean, not something we would ask for, but means that nothing broke during codfw primary, it was probably a pre-existing issue
[13:35:33] <jynus>	 sobanski: re: experiment now or in general? "at some point in the future"
[13:36:33] <jynus>	 as in "new cool tech that could be useful for some areas, but nobody is familiar with it"
[13:37:02] <jynus>	 specially instant recovery of a table at an arbitrary time in the past
[13:44:52] <jynus>	 kormat: how's db2125 behaving? Still giving issues :-(?
[13:45:39] <kormat>	 jynus: it got a new motherboard yesterday. i've reimaged it, and am currently restoring the db from backups. so hopefully i can have it run over the weekend and we'll see
[13:46:00] <jynus>	 oh, so that's better than I though, I thought hw was still missbehaving
[13:46:12] <jynus>	 hopefully all hw woes are behind us!
[13:49:39] <kormat>	 indeed :)
[13:52:29] <jynus>	 my preliminary research tells me that the error was that the primary instance and the 4 other hosts lacked a row on enwikivoyage.ipblocks
[13:52:49] <jynus>	 if the right "state" was with the row or without it, not sure if it is easy to say
[13:53:04] <jynus>	 but I would say by majority rule, it should be with it
[13:54:21] <jynus>	 I may have to talk to 2 admins on enwikivoyage to undersand what is the desired state of blocks
[14:13:30] <jynus>	 so this is the weird thing- the 2 conflicting rows were, one created on 09:05 and the other on 09:39/09:47 so unless this was caused by the inconsistency somewhere else, it may be an application issue?
[14:13:57] <jynus>	 but it wouldn't make sense it only applies to some servers
[14:14:06] <jynus>	 must be some weird data dependency
[14:14:28] <sobanski>	 These times are UTC today?
[14:14:33] <jynus>	 yes
[14:14:56] <jynus>	 the problem is writes are complex, they can fail because a unique key depending on another, older row
[14:15:30] <jynus>	 I am talking now with the admins that did the block to put the data back in the desired user state
[14:15:38] <jynus>	 so we are in a healthy place
[14:15:50] <jynus>	 and we can restart eqiad replication
[14:39:56] <kormat>	 jynus: ohh, transferpy. thanks for mentioning that. i've been paying attention to wmfbackups with what i'm currently doing to make sure i don't break that, but i forgot about transferpy :)
[14:40:27] <jynus>	 I don't think transfer.py uses it
[14:40:35] <jynus>	 but I can do a quick grep
[14:40:57] <jynus>	 I think it uses wmfmariadb.remote maybe?
[14:41:01] <kormat>	 even if it doesn't use this, i need to remember to keep it in mind
[14:41:05] <kormat>	 i think so, yeah
[14:41:14] <jynus>	 this is the downside of split repos
[14:41:24] <jynus>	 if it was one, you could just do everthing on your own
[14:41:35] <jynus>	 on the bad side, it requires some coordination
[14:41:38] <kormat>	 yeah plusses and minuses
[14:41:41] <jynus>	 yep
[14:41:52] <jynus>	 not a big minus, but something to take in mind
[14:41:58] <jynus>	 also this was bound to happen
[14:42:02] <jynus>	 the api was very green
[14:42:08] <jynus>	 not preciselly very stable
[14:42:32] <jynus>	 but it is the price to pay to not duplicate code!
[14:42:42] <kormat>	 yeah :)
[14:44:08] <sobanski>	 Maybe we could start generating some sort of a dependency graph while it's still early
[14:44:37] <kormat>	 i'll grab a napkin!
[14:45:14] <sobanski>	 npkn.io - the cloud service to write notes you will certainly lose before you need them
[14:45:30] <kormat>	 :D
[14:45:46] <sobanski>	 Or maybe napk.in
[14:46:09] <sobanski>	 Well, that already exists
[14:46:21] <sobanski>	 Not as awesome as my idea though
[14:51:36] <kormat>	 jynus: re: the db sustained lag alerts this morning, it looks like the same deal - spike of writes to s4 master, replication struggling a bit to catch up
[14:51:46] <jynus>	 interesting
[14:52:02] <jynus>	 in the past we had lots of lag due to boths doing huge number of writes
[14:52:28] <jynus>	 but this is another manifestation- spikes happening here and there, but not for a long time
[14:52:47] <jynus>	 as long as there is not primary dc lag, there is not an issue
[14:53:09] <jynus>	 but I wonder how perf will handle cross-dc lag if they want active-active
[15:19:07] <jynus>	 kormat: if around, I have a long term fix for enwikivoyage, but I would like some support
[15:19:10] <jynus>	 in case things break
[15:19:32] <jynus>	 I am going to drop the row on eqiad with replication
[15:19:49] <jynus>	 then run a drop for both old and new rows on the real master
[15:20:08] <jynus>	 and finally insert the last version of the rows, as agreed with contributors
[15:20:22] <jynus>	 I need someone to check if alerts of replication issues arise, that are unexpected
[15:21:09] <kormat>	 jynus: do you think this is something that needs to happen today, or could it wait until monday?
[15:21:16] <jynus>	 it has to be today
[15:21:19] <kormat>	 (just because it's friday evening)
[15:21:20] <jynus>	 beacuse if not
[15:21:29] <jynus>	 it will happen during the weekend
[15:22:08] <jynus>	 when there is user activity on that table again (things are inconsistent right now)
[15:22:16] <kormat>	 ah, i see
[15:22:40] <jynus>	 I am not doing for pleasure :-D
[15:22:42] <kormat>	 ok, i'm around now anyway
[15:22:43] <jynus>	 *it
[15:23:53] <jynus>	 I just logged
[15:24:01] <jynus>	 proceeding with eqiad change
[15:26:38] <jynus>	 starting replication on db1100
[15:27:14] <jynus>	 waiting a second there is no fallout on eqiad before prociding with the cluster wide deletes
[15:27:33] <jynus>	 looks fine so far
[15:28:02] <jynus>	 I will now do the master deletes and reinsert to go back to consistency
[15:28:45] <jynus>	 no issues so far, right?
[15:29:06] <jynus>	 (lag on eqiad, sure, but no breakage?
[15:29:07] <jynus>	 )
[15:29:27] <jynus>	 kormat: ?
[15:30:18] <kormat>	 correct, as far as i can see
[15:30:33] <jynus>	 ok, then doing the consistency delete + insert
[15:31:20] <jynus>	 on db2123
[15:31:33] <jynus>	 with replication
[15:33:03] <jynus>	 done
[15:33:36] <jynus>	 hopefully all nodes like that, and we are back to a consistent state (that will get us through the weekend)
[15:33:56] <jynus>	 no errors so far
[15:34:12] <jynus>	 now we just have to wait that eqiad also likes it after it catches up
[15:35:41] <jynus>	 looking good, and there should be no issues with autoincrement
[15:37:23] <jynus>	 (so just to clarify, the issue wasn't eqiad, which could be stopped for the weekend, but the discrepancy still between codfw servers)
[15:37:59] <jynus>	 everthing looking good, I will be around for a while
[15:38:23] <jynus>	 but I think we are out of danger
[15:38:37] <kormat>	 (gotcha)
[15:38:40] <kormat>	 cool :)
[15:39:36] <jynus>	 and now running db-compare to verify everthing is fine
[15:45:06] <jynus>	 https://phabricator.wikimedia.org/P12796#71122 yay
[15:45:25] <kormat>	 nice!
[15:52:08] <sobanski>	 The rare case where nothing = good
[15:55:33] <jynus>	 he he, we have lots of tools where no output and exit code == 0 is what we want
[16:01:45] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) p:05High→03Medium After discussing proposed fix of table inconsistency with enwikivoyage admins, an old block, that was only applied on c...
[16:20:40] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10RhinosF1) > After discussing proposed fix of table inconsistency with enwikivoyage admins Was this public anywhere for the sake of transparency? Could...
[16:22:25] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) > Was this public anywhere for the sake of transparency? Could a log / page be linked to?  Yes, it was on their Village pump. https://en.wiki...
[16:23:44] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10RhinosF1) Thanks for the quick reply @jcrespo
[16:26:26] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) Effectively no block was applied or removed by me, only metadata was made consistent by "merging" 2 other partially applied blocks. Logs wher...
[18:52:35] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) >>! In T263842#6493968, @akosiaris wrote: > List of affected wikis >  > ` > apiportalwiki > avkwiki > cebwiki > dewiki > enwikivoyage > ja...
[19:02:01] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) >>! In T263842#6494987, @Marostegui wrote: >>>! In T263842#6493968, @akosiaris wrote: >> List of affected wikis >>  >> ` >> apiportalwiki...