[04:55:40] Amir1: what you can do is rename them first and then drop [05:29:11] taavi: Can you retry in clouddb1016 and let me know if it works no? [05:38:16] <_joe_> Amir1: listen to marostegui, never EVER delete anything from a DB with such a command before you're 300% sure it works [05:38:24] <_joe_> first rename, once that worked, drop [06:02:31] I am switching es7 primary [06:35:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@x3.service on db1154:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:40:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@x3.service on db1154:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:45] jynus: I want to start migrating sX or mX to 10.11, but I want to sync up with you regarding the backups there, is there any section I could start (i don't mind either sX or mX) with? [07:41:08] can I start kernel upgrades on s8 in eqiad? [07:45:19] +1 [07:49:29] ok, thanks [07:51:19] marostegui: just start with any, I will handle anything [07:53:09] although we should test recovery to avoid a surprise like last time [07:56:13] FYI: Last snapshot for s8 at eqiad (db1171) taken on 2025-06-04 01:52:23 is 1216 GiB, but the previous one was 1587 GiB, a change of -23.3 % [07:56:27] Last snapshot for x3 at codfw (db2200) taken on 2025-06-04 07:00:30 is 334 GiB, but the previous one was 1351 GiB, a change of -75.3 % [07:57:14] jynus: Yeah, do you want me to hold until you test the recovery? [07:58:02] no, the first one can go at the same time- we cannot test well if it is not migrated [07:58:42] but I wonder if you have any broken or spare host to test, e.g. a host you need to rebuild [07:59:27] jynus: We don't have any at the moment (db1246 is still waiting for replacement) but we can just recover any for training purposes [08:00:13] jynus: Maybe I should start with one sX codfw because it doesn't have wiki replicas. Cause if I start with mX we have to upgrade the backpsource which is common to all mX sections [08:00:17] So I say let's start with an easy section [08:00:46] first core hosts, last backups (we make a pause here for testing) and then progress after that [08:01:20] jynus: sounds good, I will start with s6 codfw [08:01:28] ok to me [08:01:47] just expect some wait after completion of the first [08:01:59] then the rest should be able to be done with no waits [08:02:00] of course [08:02:44] thanks to you for communicating so well [08:11:59] this allows me to reorganize the backups before any upgrade work on my side [08:23:11] marostegui: could s2 go after s6? I can do it in any order, but that would mean 0 changes on the current backup config [08:24:19] jynus: Yeah, no problem at all [08:31:30] marostegui: the permissions still seem to be off. at least the `labsdbadmin` account is missing the ability to grant the `labsdbadmin` role [08:31:44] taavi: Ah, I only worked on labsdbuser [08:31:46] Let me check that one [08:33:54] taavi: granted, recheck please [08:34:37] one second [08:41:27] the grants seem fine now, thanks! the maintain-dbusers script to create new users still doesn't see that it needs to backfill all those users, but that's on us to fix [08:42:17] ok, I am waiting for now then [08:42:24] have you tested both clouddb1016 and clouddb1020? [08:43:06] 1016 only. good point about checking 1020 as well, doing that now [08:43:11] thanks [08:50:11] the grants seem ok on 1020 [08:50:25] excellent [09:27:22] federico3: go fo it! [09:27:33] (s8 reboots) [09:27:56] Guys please be aware I am renaming the datadir on pc1: https://phabricator.wikimedia.org/T395983 [13:38:34] Amir1, marostegui: did some updates to https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/merge_requests/3 [14:44:56] the backup source s2 upgrade is sort of ready: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153641 [15:01:30] Could I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153648 please? it's to load 2 new backends (and drain 2 old ones to free up rack space) [15:02:12] Emperor: how can I verify it? [15:02:42] I was checking it FYI [15:02:58] ok jynus want to do it ? [15:02:58] federico3: the commit message has a link to the Swift ring management docs which describe the syntax of the yaml file. [15:03:43] https://phabricator.wikimedia.org/T393104#10883418 notes the reasoning for this being only 2 of the 4 new nodes [15:03:59] yep, you commented it at the meeting [15:04:11] I may be able to remove 2 hosts Soon(TM) [15:04:40] sadly, lately my workload has been handled as a stack, not a queue [15:07:29] 's OK, if nothing else comes along quicker, this CR will drain a couple of nodes that will free up space for the last 2 new ones [15:12:20] federico3: one thing I can suggest is, on the next patch, specially if it is a bit more interesting, you could shadow me and I go over the full list of things I think about/check/do [15:12:41] obviously how I do it is not how everybody does it [15:15:46] ok thanks! [18:05:48] FIRING: PuppetFailure: Puppet has failed on ms-be2066:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:05:48] FIRING: PuppetFailure: Puppet has failed on ms-be2066:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure