[04:55:40] <marostegui>	 Amir1: what you can do is rename them first and then drop 
[05:29:11] <marostegui>	 taavi: Can you retry in clouddb1016 and let me know if it works no? 
[05:38:16] <_joe_>	 Amir1: listen to marostegui, never EVER delete anything from a DB with such a command before you're 300% sure it works
[05:38:24] <_joe_>	 first rename, once that worked, drop
[06:02:31] <marostegui>	 I am switching es7 primary
[06:35:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@x3.service on db1154:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:40:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@x3.service on db1154:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:35:45] <marostegui>	 jynus: I want to start migrating sX or mX to 10.11, but I want to sync up with you regarding the backups there, is there any section I could start (i don't mind either sX or mX) with?
[07:41:08] <federico3>	 can I start kernel upgrades on s8 in eqiad?
[07:45:19] <marostegui>	 +1
[07:49:29] <federico3>	 ok, thanks
[07:51:19] <jynus>	 marostegui: just start with any, I will handle anything
[07:53:09] <jynus>	 although we should test recovery to avoid a surprise like last time
[07:56:13] <jynus>	 FYI: Last snapshot for s8 at eqiad (db1171) taken on 2025-06-04 01:52:23 is 1216 GiB, but the previous one was 1587 GiB, a change of -23.3 %
[07:56:27] <jynus>	 Last snapshot for x3 at codfw (db2200) taken on 2025-06-04 07:00:30 is 334 GiB, but the previous one was 1351 GiB, a change of -75.3 %
[07:57:14] <marostegui>	 jynus: Yeah, do you want me to hold until you test the recovery?
[07:58:02] <jynus>	 no, the first one can go at the same time- we cannot test well if it is not migrated
[07:58:42] <jynus>	 but I wonder if you have any broken or spare host to test, e.g. a host you need to rebuild
[07:59:27] <marostegui>	 jynus: We don't have any at the moment (db1246 is still waiting for replacement) but we can just recover any for training purposes 
[08:00:13] <marostegui>	 jynus: Maybe I should start with one sX codfw because it doesn't have wiki replicas. Cause if I start with mX we have to upgrade the backpsource which is common to all mX sections 
[08:00:17] <jynus>	 So I say let's start with an easy section
[08:00:46] <jynus>	 first core hosts, last backups (we make a pause here for testing) and then progress after that
[08:01:20] <marostegui>	 jynus: sounds good, I will start with s6 codfw
[08:01:28] <jynus>	 ok to me
[08:01:47] <jynus>	 just expect some wait after completion of the first
[08:01:59] <jynus>	 then the rest should be able to be done with no waits
[08:02:00] <marostegui>	 of course
[08:02:44] <jynus>	 thanks to you for communicating so well
[08:11:59] <jynus>	 this allows me to reorganize the backups before any upgrade work on my side
[08:23:11] <jynus>	 marostegui: could s2 go after s6? I can do it in any order, but that would mean 0 changes on the current backup config
[08:24:19] <marostegui>	 jynus: Yeah, no problem at all
[08:31:30] <taavi>	 marostegui: the permissions still seem to be off. at least the `labsdbadmin` account is missing the ability to grant the `labsdbadmin` role
[08:31:44] <marostegui>	 taavi: Ah, I only worked on labsdbuser
[08:31:46] <marostegui>	 Let me check that one
[08:33:54] <marostegui>	 taavi: granted, recheck please
[08:34:37] <taavi>	 one second
[08:41:27] <taavi>	 the grants seem fine now, thanks! the maintain-dbusers script to create new users still doesn't see that it needs to backfill all those users, but that's on us to fix
[08:42:17] <marostegui>	 ok, I am waiting for now then
[08:42:24] <marostegui>	 have you tested both clouddb1016 and clouddb1020?
[08:43:06] <taavi>	 1016 only. good point about checking 1020 as well, doing that now
[08:43:11] <marostegui>	 thanks
[08:50:11] <taavi>	 the grants seem ok on 1020
[08:50:25] <marostegui>	 excellent
[09:27:22] <Amir1>	 federico3: go fo it!
[09:27:33] <Amir1>	 (s8 reboots)
[09:27:56] <marostegui>	 Guys please be aware I am renaming the datadir on pc1: https://phabricator.wikimedia.org/T395983
[13:38:34] <federico3>	 Amir1, marostegui: did some updates to https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/merge_requests/3 
[14:44:56] <jynus>	 the backup source s2 upgrade is sort of ready: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153641
[15:01:30] <Emperor>	 Could I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153648 please? it's to load 2 new backends (and drain 2 old ones to free up rack space)
[15:02:12] <federico3>	 Emperor: how can I verify it?
[15:02:42] <jynus>	 I was checking it FYI
[15:02:58] <federico3>	 ok jynus want to do it ?
[15:02:58] <Emperor>	 federico3: the commit message has a link to the Swift ring management docs which describe the syntax of the yaml file.
[15:03:43] <Emperor>	 https://phabricator.wikimedia.org/T393104#10883418 notes the reasoning for this being only 2 of the 4 new nodes
[15:03:59] <jynus>	 yep, you commented it at the meeting
[15:04:11] <jynus>	 I may be able to remove 2 hosts Soon(TM)
[15:04:40] <jynus>	 sadly, lately my workload has been handled as a stack, not a queue
[15:07:29] <Emperor>	 's OK, if nothing else comes along quicker, this CR will drain a couple of nodes that will free up space for the last 2 new ones
[15:12:20] <jynus>	 federico3: one thing I can suggest is, on the next patch, specially if it is a bit more interesting, you could shadow me and I go over the full list of things I think about/check/do
[15:12:41] <jynus>	 obviously how I do it is not how everybody does it
[15:15:46] <federico3>	 ok thanks!
[18:05:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ms-be2066:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[22:05:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ms-be2066:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure