[07:14:15] <taavi>	 morning
[08:05:06] <taavi>	 arturo: let's merge https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/218 now? unless you're working on a patch to make that a map
[08:11:34] <arturo>	 ok
[08:12:41] <arturo>	 idm is not letting me auth at the moment, but +1 via IRC taavi 
[08:13:45] <taavi>	 Apply complete! Resources: 32 added, 0 changed, 0 destroyed.
[08:13:47] <taavi>	 thanks
[08:21:44] <arturo>	 cool
[08:42:30] <dcaro>	 oh my, morning! I just finished sieving all my emails from my last absence...
[08:42:36] <dcaro>	 🎉
[08:46:30] <arturo>	 o/
[09:46:37] <taavi>	 is someone touching /srv/tofu-infra in cloudcontrol1011?
[09:50:50] <taavi>	 seemingly not, as no-one has logged onto that host since thursday
[09:51:02] <taavi>	 so how does something keep changing on that repo in a way that breaks puppet??
[09:51:47] <taavi>	 i'm going to fully re-create that git clone
[09:53:24] <dhinus>	 sgtm, I wonder if it's related to the old T373815
[09:53:25] <stashbot>	 T373815: Puppet fails on cloudcontrol when updating /srv/tofu-infra - https://phabricator.wikimedia.org/T373815
[09:54:08] <dhinus>	 or maybe the parent task T374022
[09:54:09] <stashbot>	 T374022: tofu-infra: the cookbook should use a different git tree copy than the main one - https://phabricator.wikimedia.org/T374022
[09:55:14] <dhinus>	 "in case the cookbook dies, we could leave the main git tree copy in inconsistent state"
[09:55:50] <dhinus>	 the cookbook failed a few times recently because of the codfw s3 issue
[09:56:32] <taavi>	 can we fix the cookbook to not do that?
[09:57:09] <dcaro>	 iirc it was supposed to have been fixed
[09:58:01] <arturo>	 T374022 was fixed, and the ticket is marked as resolved
[10:02:59] <arturo>	 mmm
[10:03:10] <arturo>	 but the fix was only for the case in which the cookbook is operating on a MR
[10:04:52] <arturo>	 taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/219
[10:35:04] <arturo>	 there is a power outage in my block
[10:51:12] <taavi>	 arturo: i'm guessing we don't have any easy way to rename those in the state instead of recreating?
[11:13:44] <arturo>	 power outage update: this seems to be a massive country level power outage. I'm sending you this message using tethering from neighbors phone, my ISP doesn't work
[11:13:58] <arturo>	 taavi: very tedious, I don't recommend 
[11:14:19] <arturo>	 the change seems massive but it will be fast to apply
[11:18:14] <taavi>	 arturo: ok. I guess I should apply that given your power/network situation?
[11:18:37] <arturo>	 yes, I cannot be in the laptop at the moment 
[11:21:25] <taavi>	 tofu seemingly has decided that when applying that it'll destroy all the things first and create only afterwards :/
[11:22:10] <arturo>	 yesh
[11:22:14] <arturo>	 expected 
[11:23:18] <taavi>	 now it's creating things again
[11:23:21] <taavi>	 all done in eqiad1
[11:23:25] <taavi>	 Apply complete! Resources: 136 added, 0 changed, 137 destroyed.
[11:23:30] <taavi>	 one less weirdly enough
[11:24:13] <arturo>	 yes, I deleted one that was not required 
[13:01:47] <dhinus>	 I'm gonna restart tools-db-5 (replica) and tools-db-4 (primary) for T392596
[13:01:47] <stashbot>	 T392596: [toolsdb] Upgrade from 10.6.20 to 10.6.21 - https://phabricator.wikimedia.org/T392596
[13:04:58] <taavi>	 ack
[13:05:55] <dhinus>	 tools-db-5 restarted, proceeding with tools-db-4
[13:08:23] <dhinus>	 as expected shutdown of primary is taking a while
[13:10:05] <dcaro>	 ack
[13:10:12] <dcaro>	 let us know if you want any help
[13:10:24] <dhinus>	 so far all going as planned
[13:10:38] <dhinus>	 I'm not sure exactly what takes so long, but it happened the last time I restarted mariadb as well
[13:10:45] <dhinus>	 the last log line is "InnoDB: FTS optimize thread exiting."
[13:13:48] <dcaro>	 dhinus: just acked the page
[13:13:59] <dcaro>	 was that expected?
[13:14:10] <dhinus>	 oops sorry, I should have silenced
[13:14:28] <dcaro>	 ack, np, it's the reboot then that brings it read-only right?
[13:14:32] <dcaro>	 (not that it flipped after)
[13:14:34] <dhinus>	 I will add a note to https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Minor_version_upgrade_(e.g._10.6.19_to_10.6.20)
[13:15:05] <dhinus>	 yes, right now it's completely unreachable, when the shutdown completes, it will come up as read-only until manually set to RW
[13:15:05] <dcaro>	 there's also an alert with `ToolsDB replication is broken on tools-db-5 (errno 2003)`
[13:15:13] <dcaro>	 ack
[13:15:19] <dcaro>	 thanks, just double checking :)
[13:18:07] <dhinus>	 still shutting down, we're at 10 mins, the last time it completed in about 8 mins
[13:18:31] <dhinus>	 no logs since 13:07 UTC
[13:19:01] <dhinus>	 CPU usage is pretty low but not zero for process mysqld
[13:19:56] <dhinus>	 htop shows some threads are going in and out of D state
[13:22:04] <dcaro>	 hmm... I don't like D state :/
[13:22:20] <dcaro>	 what does iotop show?
[13:22:26] <dcaro>	 what files/devices are they getting stuck on?
[13:22:28] <dhinus>	 I took a screenshot of htop
[13:22:40] <dhinus>	 you can also ssh to tools-db-4 to debug in realtime
[13:24:41] <dcaro>	 I see peaks of ~30MB/s on /dev/sdb
[13:24:59] <dcaro>	 writes
[13:25:29] <dhinus>	 yes, so disk is the bottleneck maybe
[13:25:57] <dhinus>	 but it shouldn't write so much data before shutdown
[13:26:13] <dcaro>	 does it wait for the last query to finish? maybe it's executing a big one?
[13:27:46] <dhinus>	 I don't think so, because I cannot connect to the server, so all connections were terminated
[13:28:18] <dhinus>	 (or at least I believe so, usually if there's a query running I can still connect and run "SHOW PROCESSLIST")
[13:29:04] <dhinus>	 we are now at 20 minutes since the last log line
[13:30:05] <dhinus>	 I would like to avoid using "kill -9" but I think I'll have to do it at some point, maybe after 30 minutes from the shutdown command
[13:30:42] <dcaro>	 I was trying to connect yep, and was unable to :/
[13:38:32] <dhinus>	 innodb_fast_shutdown is set to 1, which is not the slowest setting https://mariadb.com/kb/en/innodb-system-variables/#innodb_fast_shutdown
[13:38:42] <dhinus>	 this will need more investigation
[13:42:14] <dhinus>	 we're now past 30 mins, mysqld is still doing something, with ~20MB/s disk writes
[13:43:44] <dhinus>	 I'm a bit reluctant to use "kill -9", I think instead I can promote the replica to primary
[13:44:07] <dhinus>	 The replica has already been restarted and is working fine
[13:44:35] <dhinus>	 The only downside is it requires a DNS update, and some clients have probably cached the IP, but it should not be too bad
[13:48:12] <dhinus>	 ha, the failover procedure requires to know the GTID of the primary, and I can't check it. it should be in theory the same ID I can see in the replica, but that's an extra question mark.
[13:52:12] <dhinus>	 ok I think it should be fine, I'm gonna continue with the failover, following the docs at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Changing_a_Replica_to_Become_the_Primary
[13:56:19] <dhinus>	 dcaro: taavi: can I get a +1 for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/220/diffs
[13:56:42] <taavi>	 👀
[13:56:56] <taavi>	 dhinus: syntax error, you're missing an end quote for the replica address
[13:57:05] <dhinus>	 thanks!
[13:57:27] <taavi>	 side note: why's that not a CNAME? just having the hostname there would be a bit nicer to handle
[13:58:01] <dhinus>	 hmm that's true, I'll fix that later!
[13:58:19] <taavi>	 anyway, +1 on the current state
[13:58:49] <dhinus>	 of course creating this MR was enough for mariadb to complete the shutdown
[13:58:55] <taavi>	 lol
[13:59:13] <dhinus>	 I'm abandoning the MR and going back to the original plan (tools-db-4 will REMAIN the primary)
[13:59:19] <dcaro>	 xd
[14:00:51] <dhinus>	 tools-db-4 is up and back to RW
[14:01:14] <dhinus>	 now, I have to restart the replica on -5
[14:03:33] <dhinus>	 replica up, and in sync
[14:07:01] <dhinus>	 this was more adventurous than I was hoping for :/ in retrospect, we could have declared an incident after maybe 20 mins of downtime
[14:07:10] <dhinus>	 I will send an email to cloud-announce
[14:08:08] <dcaro>	 ack thanks
[14:53:07] <dhinus>	 WUT? :D https://usercontent.irccloud-cdn.com/file/slvbm5iD/Screenshot%202025-04-28%20at%2016.52.13.png
[14:53:23] <dhinus>	 (this was at the end of the tools-db shutdown)
[14:54:06] <dhinus>	 actually immediately after mariadb restarted
[14:54:38] <dhinus>	 maybe just a prometheus glitch, but an interesting one
[15:02:49] <dcaro>	 uuuuhhh, interesting
[15:03:02] <dcaro>	 I did not see any process stuck in D continuously though
[15:17:12] <dhinus>	 no, I didn't either
[15:27:16] <taavi>	 cherry-picking https://gerrit.wikimedia.org/r/c/operations/puppet/+/1139455 to tools-puppetserver to test it
[15:30:44] <taavi>	 ok, that works perfect
[15:30:47] <taavi>	 so reviews welcome :D
[15:30:53] <taavi>	 and I need to do the matching thing in metricsinfra
[15:38:18] <dhinus>	 taavi: I don't have time to push the CNAME patch today (I'm trying to wrap up the other things related to the slow shutdown), but I created T392831 to do that later this week
[15:38:18] <stashbot>	 T392831: [toolsdb] Use DNS CNAMEs instead of A records - https://phabricator.wikimedia.org/T392831
[15:38:30] <taavi>	 thanks!
[15:38:37] <dhinus>	 feel free to send a patch yourself if you have time, otherwise I'll do it on Wed (I'm out tomorrow)
[15:46:32] <dhinus>	 I'm going offline a little earlier, see you all on Wed!
[15:48:22] <dhinus>	 re: mariadb slow shutdown, I added some logs and details in T392596
[15:48:22] <stashbot>	 T392596: [toolsdb] Upgrade from 10.6.20 to 10.6.21 - https://phabricator.wikimedia.org/T392596
[15:48:42] <dhinus>	 and created a follow-up task T392828
[15:48:42] <stashbot>	 T392828: [toolsdb] MariaDB sometimes takes very long to shut down - https://phabricator.wikimedia.org/T392828
[15:49:41] * dhinus offline