[02:24:30] 10DBA, 10Wikimedia-Rdbms, 10Goal, 10Patch-For-Review, and 3 others: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW - https://phabricator.wikimedia.org/T221159 (10Krinkle) @Marostegui FYI - I assume you'd like this change as well, but let me know if not :) [04:42:38] jynus: we might need to redo the snapshot of s8, I think it has started while the MCR change is still on-going, which is kinda hard to avoid as it takes 24h :) [04:42:43] (still running btw) [04:42:49] on db1116:3318 [05:25:44] 10DBA, 10Patch-For-Review, 10User-Urbanecm, 10cloud-services-team (Kanban): Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10Marostegui) >>! In T259438#6377907, @bd808 wrote: >>>! In T259438#6375572, @Marostegui wrote: >> @bd808 @Bstorm we have moved two w... [06:23:30] should I kill it? [06:28:41] is m3 maintenance postponed? [06:31:37] oh, I see it is the 18th [06:32:00] 10DBA, 10Operations, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10jcrespo) Maybe this can be scheduled before or after the maintenance for T259589? [06:36:59] yes, moved to tuesday [06:57:03] 10DBA, 10Operations, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10Marostegui) I don't have the bandwidth to prepare this change before Tuesday - @jcrespo if you happen to have some room to prepare this before Tuesday (or after), please t... [07:20:17] 10DBA, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 3 others: Improve privilege separation for phabricator's config files and mysql credentials - https://phabricator.wikimedia.org/T146055 (10mmodell) The next step here is to create a `$passwords::mysql::phabricator::phd_pass` and updat... [07:22:44] 10DBA, 10Operations, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10mmodell) I made a note about how we go about setting a separate password for PHD daemons in a comment at T146055#6378825. Essentially we just need to define a new passwor... [07:37:05] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1110.eqiad.wmnet'] ` The log can be found in `/var/log/wmf... [08:10:27] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1110.eqiad.wmnet'] ` and were **ALL** successful. [08:47:13] 10DBA, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 3 others: Improve privilege separation for phabricator's config files and mysql credentials - https://phabricator.wikimedia.org/T146055 (10jcrespo) I've created `$passwords::mysql::phabricator::phd_pass` on the private repo, but not s... [10:04:18] I will take care of x1 [10:04:27] but I will first change location [10:04:40] x1? [11:18:39] have a look https://gerrit.wikimedia.org/r/c/operations/puppet/+/619729 [11:19:30] is that what you meant with x1? [11:21:11] no, x1 is to remove older x1 backups [11:21:18] 10DBA, 10Patch-For-Review, 10User-Kormat: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10Marostegui) 05Stalled→03Open Unstalling it...I think we can actually close this no? It's been working without many false positives (apart from backup sources, which is add... [11:21:20] 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10Marostegui) [11:21:27] jynus: ah gotcha [11:21:37] because they have moved server away, purge stops happening automatically [11:21:48] in the past everything old was purged [11:22:00] but I found the hard way that was not a good policy [11:22:08] so it needs now manual cleanup :-) [11:22:28] I wonder why I didn't see the calendar notification [11:22:55] mmm, I stealth invited you because I was going to take care of that [11:23:08] but in case I was unavailable I wanted to optionall add you [11:23:17] e.g. I was on vacations or something [11:23:27] Yeah, I accepted it, but not sure if I missed the notification or it never arrived to my phone [11:23:28] I think I was going to tell you and I may forget [11:23:49] I skipped the notification as there was a 99% chance you wouldn't have to do anything [11:23:57] so I didn't bother you with useless notifications [11:24:21] and then I forget to tell you on monday, I think [11:24:26] forgot [11:24:30] :) [11:28:23] 10DBA, 10Patch-For-Review, 10User-Kormat: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10jcrespo) Previous check, and some previous comment is because the prometheus-based alert doesn't work well when replication is stopped, FYI (we get unknowns). I sent the patch... [11:29:52] 10DBA, 10Patch-For-Review, 10User-Kormat: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10jcrespo) > It's been working without many false positives Could we do a last test keeping a replica on codfw with 2 seconds of constant lag just to be sure? I can arrange it... [11:30:51] 10DBA, 10Patch-For-Review, 10User-Kormat: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10Marostegui) I think in general, the workflow for any host that will get replication stopped is to downtime it first. In that sense, the backup sources get excluded from this w... [11:31:45] BTE, marostegui did you remove the x1 files yourself? [11:31:49] jynus: nop [11:31:58] because I don't remember doing it, but they are not there [11:32:13] oh, sorry, wrong server [11:34:00] and I just deleted the wrong ones [11:34:09] :-) [11:34:11] XD [11:34:26] I am glad we have backups! [11:34:31] exactly [11:35:22] and redundancy [11:36:23] https://jynus.com/gif/high_five.gifv [11:36:35] I will schedule the deletion of the dumps in a month [11:36:50] will send a proper invite, will be less confusing overal [11:37:19] we have redundancy of redundancy or redundancy in fact [11:38:01] multiple snapshots, if that doesn't work, dumps, if that doesn't work, geographical redundancy, and if that doesn't work, the copies on bacula [11:44:04] marostegui: are you using the codfw test host, does it still have s1? [11:44:23] jynus: I updated db2102 yesterday, but I didn't touch anything else [11:44:25] I would like to test the prometheus check there [11:44:43] yep, go ahead [11:45:33] actually, I cannot use that host, it doesn't have the check [11:45:45] I will chose a random mw codfw one [11:46:15] one that is not on s8 :-) [11:46:24] hehehe [11:47:07] btw I guess the warning about s8 backup size is due to the MCR change? [11:47:14] it was deployed already in codfw [11:48:36] marostegui: will check soon [11:48:48] for now heads up on db2130 replication [11:48:50] will log it [11:50:18] ok! [11:51:27] I did CHANGE MASTER TO MASTER_DELAY=2; [11:51:33] not sure if that will be enough [11:52:39] I didn't get the last part of " if you make your patch work I think we'd be good" do you mean that it would be better if the alert was reenabled in a working state? [11:53:32] or that disabling it is the best option (for now)? [11:53:37] No, I that was a response to your previous comment about you not being sure how to move forward with the rest of the hosts [11:54:20] so my patch only removes the alert [11:54:52] because making it work is quite difficult right now (we don't have pt-heartbeat on prometheus) [11:55:07] (on some hosts of course, not on core) [11:55:38] yep, I +1'ed it [11:56:23] so you mean with that that you are ok with merging that? [11:56:45] what I don't understand is what you would prefer for core hosts? [11:56:51] which the patch doesn't touch [11:57:23] which it is true the issue is not as big there, but still affects them with lower impact [11:57:42] So I prefer to leave the alert enabled for core hosts [11:57:45] +1 [11:58:02] so "known issue, we will fix it at a later time", right? [11:58:12] which was kinda my take on that [11:58:26] yep, what I meant is that the current worflow for stopping replication on other hosts apart from backup sources is that we always downtime them [11:58:29] and close the ticket, right? [11:58:31] cool [11:58:34] so we shouldn't be affected by that "feature" [11:58:47] yep, +1 to merge your patch (if you believe it will work) and close the task [11:58:49] I just didn't fully what you meant because my patch didn't really fix it [11:59:02] *understand [11:59:25] it fix your problem :) [11:59:37] he [11:59:38] barely [11:59:49] it just ignores it on a place were we don't care [12:00:02] (we don't care if backup hosts are 3 seconds behind) [12:00:08] the alter on db1116:3318 finished (backup source for s8) [12:00:26] I will wait for stephen input, I don't understand the implications of the mysql_role code [12:00:36] I will check the backup size [12:00:43] sounds good, my +1 was for the idea :) [12:00:54] and retry s8 snapshots on eqiad [12:00:59] based on your input [12:01:18] db2130 worked nicely [12:01:34] PROBLEM - 5-minute average replication lag is over 2s on db2130 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2130&var-port=9104&var-dc=codfw+prometheus/ops [12:02:09] went warning much earlier [12:02:26] reseted the delay to 0 [12:02:30] should go away [12:02:33] nice [12:04:21] 10DBA, 10Patch-For-Review, 10User-Kormat: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10jcrespo) Did: ` STOP SLAVE; CHANGE MASTER TO MASTER_DELAY=2; START SLAVE; ` And the alert happened nicely. The only edge case, other than the stop slaves on dbstores, is a p... [12:05:28] RECOVERY - 5-minute average replication lag is over 2s on db2130 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2130&var-port=9104&var-dc=codfw+prometheus/ops [12:07:16] good work kormat on the alert, not sure if I said that [12:07:40] oh, thanks :) [12:08:09] this may have been my second favourite patch of yours so far [12:09:13] first one being the partman ones [12:09:43] the funny thing is - i've pretty much completely blocked the partman stuff out of my mind. you mentioned it again a week or two ago and i was all "huuuh. _that_ thing" [12:09:54] he [12:10:07] now you understand when you ask me about wmfmariadbpy scripts [12:10:10] :-) [12:10:29] yep :) [12:10:39] I did whaaaaat? [12:11:16] 630 wikis out of 900 are done so far, 600 drifts :D [12:11:23] XDDDD [12:12:12] https://www.irccloud.com/pastebin/L2Ui6Lpe/ [12:13:04] that'll be fun [12:22:51] I need to increase my local console buffer, I try to run screen -ls on cumin1001, but manuel's sessions keep not fitting :-P [12:23:27] you still have one there too! [12:23:57] :-; [12:24:01] ;-) [12:32:36] jynus: https://phabricator.wikimedia.org/T238966#6375188 [12:32:45] matches your paste [12:33:16] do you know if the _temp tables will disappear eventually too? [12:33:28] yep, they will [12:33:40] not sure if before the end of 2020 [13:48:07] i'll do a release with the fixed switchover.py [14:04:44] 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Kormat) 05Open→03Resolved Fix is merged, and a fresh debian package has been released, and installed on both cumin hosts. [14:05:36] 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat: switchover.py breaks on 10.4 master - https://phabricator.wikimedia.org/T260127 (10Marostegui) Thanks for addressing this so fast! [21:50:14] 10DBA, 10Wikimedia-Site-requests: Ensure dblist shard files match db-*.php definitions - https://phabricator.wikimedia.org/T260297 (10Urbanecm) [22:28:14] 10DBA, 10Patch-For-Review, 10User-Urbanecm, 10cloud-services-team (Kanban): Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10bd808) >>! In T259438#6378639, @Marostegui wrote: > I have ran the following on all labsdb hosts, but not sure whether this worked... [22:56:27] 10DBA, 10Patch-For-Review, 10User-Urbanecm, 10cloud-services-team (Kanban): Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10bd808) >>! In T259438#6378639, @Marostegui wrote: > I have sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/619627 is thi...