[07:14:15] morning [08:05:06] arturo: let's merge https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/218 now? unless you're working on a patch to make that a map [08:11:34] ok [08:12:41] idm is not letting me auth at the moment, but +1 via IRC taavi [08:13:45] Apply complete! Resources: 32 added, 0 changed, 0 destroyed. [08:13:47] thanks [08:21:44] cool [08:42:30] oh my, morning! I just finished sieving all my emails from my last absence... [08:42:36] 🎉 [08:46:30] o/ [09:46:37] is someone touching /srv/tofu-infra in cloudcontrol1011? [09:50:50] seemingly not, as no-one has logged onto that host since thursday [09:51:02] so how does something keep changing on that repo in a way that breaks puppet?? [09:51:47] i'm going to fully re-create that git clone [09:53:24] sgtm, I wonder if it's related to the old T373815 [09:53:25] T373815: Puppet fails on cloudcontrol when updating /srv/tofu-infra - https://phabricator.wikimedia.org/T373815 [09:54:08] or maybe the parent task T374022 [09:54:09] T374022: tofu-infra: the cookbook should use a different git tree copy than the main one - https://phabricator.wikimedia.org/T374022 [09:55:14] "in case the cookbook dies, we could leave the main git tree copy in inconsistent state" [09:55:50] the cookbook failed a few times recently because of the codfw s3 issue [09:56:32] can we fix the cookbook to not do that? [09:57:09] iirc it was supposed to have been fixed [09:58:01] T374022 was fixed, and the ticket is marked as resolved [10:02:59] mmm [10:03:10] but the fix was only for the case in which the cookbook is operating on a MR [10:04:52] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/219 [10:35:04] there is a power outage in my block [10:51:12] arturo: i'm guessing we don't have any easy way to rename those in the state instead of recreating? [11:13:44] power outage update: this seems to be a massive country level power outage. I'm sending you this message using tethering from neighbors phone, my ISP doesn't work [11:13:58] taavi: very tedious, I don't recommend [11:14:19] the change seems massive but it will be fast to apply [11:18:14] arturo: ok. I guess I should apply that given your power/network situation? [11:18:37] yes, I cannot be in the laptop at the moment [11:21:25] tofu seemingly has decided that when applying that it'll destroy all the things first and create only afterwards :/ [11:22:10] yesh [11:22:14] expected [11:23:18] now it's creating things again [11:23:21] all done in eqiad1 [11:23:25] Apply complete! Resources: 136 added, 0 changed, 137 destroyed. [11:23:30] one less weirdly enough [11:24:13] yes, I deleted one that was not required [13:01:47] I'm gonna restart tools-db-5 (replica) and tools-db-4 (primary) for T392596 [13:01:47] T392596: [toolsdb] Upgrade from 10.6.20 to 10.6.21 - https://phabricator.wikimedia.org/T392596 [13:04:58] ack [13:05:55] tools-db-5 restarted, proceeding with tools-db-4 [13:08:23] as expected shutdown of primary is taking a while [13:10:05] ack [13:10:12] let us know if you want any help [13:10:24] so far all going as planned [13:10:38] I'm not sure exactly what takes so long, but it happened the last time I restarted mariadb as well [13:10:45] the last log line is "InnoDB: FTS optimize thread exiting." [13:13:48] dhinus: just acked the page [13:13:59] was that expected? [13:14:10] oops sorry, I should have silenced [13:14:28] ack, np, it's the reboot then that brings it read-only right? [13:14:32] (not that it flipped after) [13:14:34] I will add a note to https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Minor_version_upgrade_(e.g._10.6.19_to_10.6.20) [13:15:05] yes, right now it's completely unreachable, when the shutdown completes, it will come up as read-only until manually set to RW [13:15:05] there's also an alert with `ToolsDB replication is broken on tools-db-5 (errno 2003)` [13:15:13] ack [13:15:19] thanks, just double checking :) [13:18:07] still shutting down, we're at 10 mins, the last time it completed in about 8 mins [13:18:31] no logs since 13:07 UTC [13:19:01] CPU usage is pretty low but not zero for process mysqld [13:19:56] htop shows some threads are going in and out of D state [13:22:04] hmm... I don't like D state :/ [13:22:20] what does iotop show? [13:22:26] what files/devices are they getting stuck on? [13:22:28] I took a screenshot of htop [13:22:40] you can also ssh to tools-db-4 to debug in realtime [13:24:41] I see peaks of ~30MB/s on /dev/sdb [13:24:59] writes [13:25:29] yes, so disk is the bottleneck maybe [13:25:57] but it shouldn't write so much data before shutdown [13:26:13] does it wait for the last query to finish? maybe it's executing a big one? [13:27:46] I don't think so, because I cannot connect to the server, so all connections were terminated [13:28:18] (or at least I believe so, usually if there's a query running I can still connect and run "SHOW PROCESSLIST") [13:29:04] we are now at 20 minutes since the last log line [13:30:05] I would like to avoid using "kill -9" but I think I'll have to do it at some point, maybe after 30 minutes from the shutdown command [13:30:42] I was trying to connect yep, and was unable to :/ [13:38:32] innodb_fast_shutdown is set to 1, which is not the slowest setting https://mariadb.com/kb/en/innodb-system-variables/#innodb_fast_shutdown [13:38:42] this will need more investigation [13:42:14] we're now past 30 mins, mysqld is still doing something, with ~20MB/s disk writes [13:43:44] I'm a bit reluctant to use "kill -9", I think instead I can promote the replica to primary [13:44:07] The replica has already been restarted and is working fine [13:44:35] The only downside is it requires a DNS update, and some clients have probably cached the IP, but it should not be too bad [13:48:12] ha, the failover procedure requires to know the GTID of the primary, and I can't check it. it should be in theory the same ID I can see in the replica, but that's an extra question mark. [13:52:12] ok I think it should be fine, I'm gonna continue with the failover, following the docs at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Changing_a_Replica_to_Become_the_Primary [13:56:19] dcaro: taavi: can I get a +1 for https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/220/diffs [13:56:42] 👀 [13:56:56] dhinus: syntax error, you're missing an end quote for the replica address [13:57:05] thanks! [13:57:27] side note: why's that not a CNAME? just having the hostname there would be a bit nicer to handle [13:58:01] hmm that's true, I'll fix that later! [13:58:19] anyway, +1 on the current state [13:58:49] of course creating this MR was enough for mariadb to complete the shutdown [13:58:55] lol [13:59:13] I'm abandoning the MR and going back to the original plan (tools-db-4 will REMAIN the primary) [13:59:19] xd [14:00:51] tools-db-4 is up and back to RW [14:01:14] now, I have to restart the replica on -5 [14:03:33] replica up, and in sync [14:07:01] this was more adventurous than I was hoping for :/ in retrospect, we could have declared an incident after maybe 20 mins of downtime [14:07:10] I will send an email to cloud-announce [14:08:08] ack thanks [14:53:07] WUT? :D https://usercontent.irccloud-cdn.com/file/slvbm5iD/Screenshot%202025-04-28%20at%2016.52.13.png [14:53:23] (this was at the end of the tools-db shutdown) [14:54:06] actually immediately after mariadb restarted [14:54:38] maybe just a prometheus glitch, but an interesting one [15:02:49] uuuuhhh, interesting [15:03:02] I did not see any process stuck in D continuously though [15:17:12] no, I didn't either [15:27:16] cherry-picking https://gerrit.wikimedia.org/r/c/operations/puppet/+/1139455 to tools-puppetserver to test it [15:30:44] ok, that works perfect [15:30:47] so reviews welcome :D [15:30:53] and I need to do the matching thing in metricsinfra [15:38:18] taavi: I don't have time to push the CNAME patch today (I'm trying to wrap up the other things related to the slow shutdown), but I created T392831 to do that later this week [15:38:18] T392831: [toolsdb] Use DNS CNAMEs instead of A records - https://phabricator.wikimedia.org/T392831 [15:38:30] thanks! [15:38:37] feel free to send a patch yourself if you have time, otherwise I'll do it on Wed (I'm out tomorrow) [15:46:32] I'm going offline a little earlier, see you all on Wed! [15:48:22] re: mariadb slow shutdown, I added some logs and details in T392596 [15:48:22] T392596: [toolsdb] Upgrade from 10.6.20 to 10.6.21 - https://phabricator.wikimedia.org/T392596 [15:48:42] and created a follow-up task T392828 [15:48:42] T392828: [toolsdb] MariaDB sometimes takes very long to shut down - https://phabricator.wikimedia.org/T392828 [15:49:41] * dhinus offline