[02:24:30] <wikibugs>	 10DBA, 10Wikimedia-Rdbms, 10Goal, 10Patch-For-Review, and 3 others: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW - https://phabricator.wikimedia.org/T221159 (10Krinkle) @Marostegui FYI - I assume you'd like this change as well, but let me know if not :)
[04:42:38] <marostegui>	 jynus: we might need to redo the snapshot of s8, I think it has started while the MCR change is still on-going, which is kinda hard to avoid as it takes 24h :)
[04:42:43] <marostegui>	 (still running btw)
[04:42:49] <marostegui>	 on db1116:3318
[05:25:44] <wikibugs>	 10DBA, 10Patch-For-Review, 10User-Urbanecm, 10cloud-services-team (Kanban): Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10Marostegui) >>! In T259438#6377907, @bd808 wrote: >>>! In T259438#6375572, @Marostegui wrote: >> @bd808 @Bstorm we have moved two w...
[06:23:30] <jynus>	 should I kill it?
[06:28:41] <jynus>	 is m3 maintenance postponed?
[06:31:37] <jynus>	 oh, I see it is the 18th
[06:32:00] <wikibugs>	 10DBA, 10Operations, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10jcrespo) Maybe this can be scheduled before or after the maintenance for T259589?
[06:36:59] <marostegui>	 yes, moved to tuesday
[06:57:03] <wikibugs>	 10DBA, 10Operations, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10Marostegui) I don't have the bandwidth to prepare this change before Tuesday - @jcrespo if you happen to have some room to prepare this before Tuesday (or after), please t...
[07:20:17] <wikibugs>	 10DBA, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 3 others: Improve privilege separation for phabricator's config files and mysql credentials - https://phabricator.wikimedia.org/T146055 (10mmodell) The next step here is to create a `$passwords::mysql::phabricator::phd_pass` and updat...
[07:22:44] <wikibugs>	 10DBA, 10Operations, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10mmodell) I made a note about how we go about setting a separate password for PHD daemons in a comment at T146055#6378825.  Essentially we just need to define a new passwor...
[07:37:05] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1110.eqiad.wmnet'] ` The log can be found in `/var/log/wmf...
[08:10:27] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1110.eqiad.wmnet'] `  and were **ALL** successful.
[08:47:13] <wikibugs>	 10DBA, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 3 others: Improve privilege separation for phabricator's config files and mysql credentials - https://phabricator.wikimedia.org/T146055 (10jcrespo) I've created `$passwords::mysql::phabricator::phd_pass` on the private repo, but not s...
[10:04:18] <jynus>	 I will take care of x1
[10:04:27] <jynus>	 but I will first change location
[10:04:40] <marostegui>	 x1?
[11:18:39] <jynus>	 have a look https://gerrit.wikimedia.org/r/c/operations/puppet/+/619729
[11:19:30] <marostegui>	 is that what you meant with x1?
[11:21:11] <jynus>	 no, x1 is to remove older x1 backups
[11:21:18] <wikibugs>	 10DBA, 10Patch-For-Review, 10User-Kormat: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10Marostegui) 05Stalled→03Open Unstalling it...I think we can actually close this no? It's been working without many false positives (apart from backup sources, which is add...
[11:21:20] <wikibugs>	 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10Marostegui)
[11:21:27] <marostegui>	 jynus: ah gotcha
[11:21:37] <jynus>	 because they have moved server away, purge stops happening automatically
[11:21:48] <jynus>	 in the past everything old was purged
[11:22:00] <jynus>	 but I found the hard way that was not a good policy
[11:22:08] <jynus>	 so it needs now manual cleanup :-)
[11:22:28] <marostegui>	 I wonder why I didn't see the calendar notification
[11:22:55] <jynus>	 mmm, I stealth invited you because I was going to take care of that
[11:23:08] <jynus>	 but in case I was unavailable I wanted to optionall add you
[11:23:17] <jynus>	 e.g. I was on vacations or something
[11:23:27] <marostegui>	 Yeah, I accepted it, but not sure if I missed the notification or it never arrived to my phone
[11:23:28] <jynus>	 I think I was going to tell you and I may forget
[11:23:49] <jynus>	 I skipped the notification as there was a 99% chance you wouldn't have to do anything
[11:23:57] <jynus>	 so I didn't bother you with useless notifications
[11:24:21] <jynus>	 and then I forget to tell you on monday, I think
[11:24:26] <jynus>	 forgot
[11:24:30] <marostegui>	 :)
[11:28:23] <wikibugs>	 10DBA, 10Patch-For-Review, 10User-Kormat: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10jcrespo) Previous check, and some previous comment is because the prometheus-based alert doesn't work well when replication is stopped, FYI (we get unknowns). I sent the patch...
[11:29:52] <wikibugs>	 10DBA, 10Patch-For-Review, 10User-Kormat: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10jcrespo) > It's been working without many false positives  Could we do a last test keeping a replica on codfw with 2 seconds of constant lag just to be sure? I can arrange it...
[11:30:51] <wikibugs>	 10DBA, 10Patch-For-Review, 10User-Kormat: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10Marostegui) I think in general, the workflow for any host that will get replication stopped is to downtime it first. In that sense, the backup sources get excluded from this w...
[11:31:45] <jynus>	 BTE, marostegui did you remove the x1 files yourself?
[11:31:49] <marostegui>	 jynus: nop
[11:31:58] <jynus>	 because I don't remember doing it, but they are not there
[11:32:13] <jynus>	 oh, sorry, wrong server
[11:34:00] <jynus>	 and I just deleted the wrong ones
[11:34:09] <jynus>	 :-)
[11:34:11] <marostegui>	 XD
[11:34:26] <marostegui>	 I am glad we have backups!
[11:34:31] <jynus>	 exactly
[11:35:22] <jynus>	 and redundancy
[11:36:23] <marostegui>	 https://jynus.com/gif/high_five.gifv
[11:36:35] <jynus>	 I will schedule the deletion of the dumps in a month
[11:36:50] <jynus>	 will send a proper invite, will be less confusing overal
[11:37:19] <jynus>	 we have redundancy of redundancy or redundancy in fact
[11:38:01] <jynus>	 multiple snapshots, if that doesn't work, dumps, if that doesn't work, geographical redundancy, and if that doesn't work, the copies on bacula
[11:44:04] <jynus>	 marostegui: are you using the codfw test host, does it still have s1?
[11:44:23] <marostegui>	 jynus: I updated db2102 yesterday, but I didn't touch anything else
[11:44:25] <jynus>	 I would like to test the prometheus check there
[11:44:43] <marostegui>	 yep, go ahead
[11:45:33] <jynus>	 actually, I cannot use that host, it doesn't have the check
[11:45:45] <jynus>	 I will chose a random mw codfw one
[11:46:15] <jynus>	 one that is not on s8 :-)
[11:46:24] <marostegui>	 hehehe
[11:47:07] <marostegui>	 btw I guess the warning about s8 backup size is due to the MCR change?
[11:47:14] <marostegui>	 it was deployed already in codfw
[11:48:36] <jynus>	 marostegui: will check soon
[11:48:48] <jynus>	 for now heads up on db2130 replication
[11:48:50] <jynus>	 will log it
[11:50:18] <marostegui>	 ok!
[11:51:27] <jynus>	 I did CHANGE MASTER TO MASTER_DELAY=2;
[11:51:33] <jynus>	 not sure if that will be enough
[11:52:39] <jynus>	 I didn't get the last part of " if you make your patch work I think we'd be good" do you mean that it would be better if the alert was reenabled in a working state?
[11:53:32] <jynus>	 or that disabling it is the best option (for now)?
[11:53:37] <marostegui>	 No, I that was a response to your previous comment about you not being sure how to move forward with the rest of the hosts
[11:54:20] <jynus>	 so my patch only removes the alert
[11:54:52] <jynus>	 because making it work is quite difficult right now (we don't have pt-heartbeat on prometheus)
[11:55:07] <jynus>	 (on some hosts of course, not on core)
[11:55:38] <marostegui>	 yep, I +1'ed it
[11:56:23] <jynus>	 so you mean with that that you are ok with merging that?
[11:56:45] <jynus>	 what I don't understand is what you would prefer for core hosts?
[11:56:51] <jynus>	 which the patch doesn't touch
[11:57:23] <jynus>	 which it is true the issue is not as big there, but still affects them with lower impact
[11:57:42] <marostegui>	 So I prefer to leave the alert enabled for core hosts
[11:57:45] <jynus>	 +1
[11:58:02] <jynus>	 so "known issue, we will fix it at a later time", right?
[11:58:12] <jynus>	 which was kinda my take on that
[11:58:26] <marostegui>	 yep, what I meant is that the current worflow for stopping replication on other hosts apart from backup sources is that we always downtime them
[11:58:29] <jynus>	 and close the ticket, right?
[11:58:31] <jynus>	 cool
[11:58:34] <marostegui>	 so we shouldn't be affected by that "feature"
[11:58:47] <marostegui>	 yep, +1 to merge your patch (if you believe it will work) and close the task
[11:58:49] <jynus>	 I just didn't fully what you meant because my patch didn't really fix it
[11:59:02] <jynus>	 *understand
[11:59:25] <marostegui>	 it fix your problem :)
[11:59:37] <jynus>	 he
[11:59:38] <jynus>	 barely
[11:59:49] <jynus>	 it just ignores it on a place were we don't care
[12:00:02] <jynus>	 (we don't care if backup hosts are 3 seconds behind)
[12:00:08] <marostegui>	 the alter on db1116:3318 finished (backup source for s8)
[12:00:26] <jynus>	 I will wait for stephen input, I don't understand the implications of the mysql_role code
[12:00:36] <jynus>	 I will check the backup size
[12:00:43] <marostegui>	 sounds good, my +1 was for the idea :)
[12:00:54] <jynus>	 and retry s8 snapshots on eqiad
[12:00:59] <jynus>	 based on your input
[12:01:18] <jynus>	 db2130 worked nicely
[12:01:34] <icinga-wm>	 PROBLEM - 5-minute average replication lag is over 2s on db2130 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2130&var-port=9104&var-dc=codfw+prometheus/ops
[12:02:09] <jynus>	 went warning much earlier
[12:02:26] <jynus>	 reseted the delay to 0
[12:02:30] <jynus>	 should go away
[12:02:33] <marostegui>	 nice
[12:04:21] <wikibugs>	 10DBA, 10Patch-For-Review, 10User-Kormat: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10jcrespo) Did: ` STOP SLAVE; CHANGE MASTER TO MASTER_DELAY=2; START SLAVE; `  And the alert happened nicely.  The only edge case, other than the stop slaves on dbstores, is a p...
[12:05:28] <icinga-wm>	 RECOVERY - 5-minute average replication lag is over 2s on db2130 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2130&var-port=9104&var-dc=codfw+prometheus/ops
[12:07:16] <jynus>	 good work kormat on the alert, not sure if I said that
[12:07:40] <kormat>	 oh, thanks :)
[12:08:09] <jynus>	 this may have been my second favourite patch of yours so far
[12:09:13] <jynus>	 first one being the partman ones
[12:09:43] <kormat>	 the funny thing is - i've pretty much completely blocked the partman stuff out of my mind. you mentioned it again a week or two ago and i was all "huuuh. _that_ thing"
[12:09:54] <jynus>	 he
[12:10:07] <jynus>	 now you understand when you ask me about wmfmariadbpy scripts
[12:10:10] <jynus>	 :-)
[12:10:29] <kormat>	 yep :)
[12:10:39] <jynus>	 I did whaaaaat?
[12:11:16] <Amir1>	 630 wikis out of 900 are done so far, 600 drifts :D
[12:11:23] <marostegui>	 XDDDD
[12:12:12] <Amir1>	 https://www.irccloud.com/pastebin/L2Ui6Lpe/
[12:13:04] <marostegui>	 that'll be fun
[12:22:51] <jynus>	 I need to increase my local console buffer, I try to run screen -ls on cumin1001, but manuel's sessions keep not fitting :-P
[12:23:27] <marostegui>	 you still have one there too!
[12:23:57] <jynus>	 :-;
[12:24:01] <jynus>	 ;-)
[12:32:36] <marostegui>	 jynus: https://phabricator.wikimedia.org/T238966#6375188
[12:32:45] <marostegui>	 matches your paste
[12:33:16] <jynus>	 do you know if the _temp tables will disappear eventually too?
[12:33:28] <marostegui>	 yep, they will
[12:33:40] <marostegui>	 not sure if before the end of 2020
[13:48:07] <kormat>	 i'll do a release with the fixed switchover.py
[14:04:44] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Kormat) 05Open→03Resolved Fix is merged, and a fresh debian package has been released, and installed on both cumin hosts.
[14:05:36] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Marostegui) Thanks for addressing this so fast!
[21:50:14] <wikibugs>	 10DBA, 10Wikimedia-Site-requests: Ensure dblist shard files match db-*.php definitions - https://phabricator.wikimedia.org/T260297 (10Urbanecm)
[22:28:14] <wikibugs>	 10DBA, 10Patch-For-Review, 10User-Urbanecm, 10cloud-services-team (Kanban): Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10bd808) >>! In T259438#6378639, @Marostegui wrote: > I have ran the following on all labsdb hosts, but not sure whether this worked...
[22:56:27] <wikibugs>	 10DBA, 10Patch-For-Review, 10User-Urbanecm, 10cloud-services-team (Kanban): Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10bd808) >>! In T259438#6378639, @Marostegui wrote: > I have sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/619627  is thi...