[01:47:48] FIRING: PuppetFailure: Puppet has failed on thanos-be2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:57:48] FIRING: [2x] PuppetFailure: Puppet has failed on thanos-be2008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:54:55] I've switched es6 codfw master [05:40:02] switched es6 primary master [05:57:48] FIRING: [2x] PuppetFailure: Puppet has failed on thanos-be2008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:58:08] es6 is now running 10.11 [06:04:45] marostegui, Amir1: can I start reboots on s8 in codfw? [06:05:28] federico3: fine by me [06:05:36] thanks [07:10:14] those puppet failures are nodes which haven't installed properly, I'll have a look in a bit [09:15:14] Emperor: o/ when you have a moment, lemme know if https://phabricator.wikimedia.org/T395659 is doable [09:15:23] (yes I know, another one! Please be patient :D) [09:17:14] Oh, yes, sorry, I meant to reply to that. [09:17:47] thanksssss [09:23:52] (though the answer, I'm afraid, is that I don't think it makes sense to move it off thanos-swift right now) [09:24:00] I've replied on-ticket, anyhow [09:35:58] Fine by me too [09:40:53] there's uncommited changes on dbctl FYI ("es2040": 6) [10:12:54] do we know where did they came from? [10:18:04] <_joe_> I would assume it's related to the work on es6? [10:19:02] yeah, that's probably marostegui [10:21:44] That's me [10:21:50] I'm coming back from the doctor [10:21:55] Should be home in 10 mins [10:22:15] It's the repooling bug [10:22:22] With the cookbook [10:25:30] the host looks healthy, I committed the change [10:32:23] so, when I mention those alerts, I don't expect people to drop things and attend them. I thought it could be useful in case someone is doing maintenance, etc [10:40:06] It was the repooling cookbook bug [10:40:32] :/ [10:43:26] I wonder now what happens with the cookbook actually [10:43:31] ==> Review the changes. Do you still want to commit them? [10:43:31] Type "go" to proceed or "abort" to interrupt the execution [10:43:32] > go [10:43:32] User input is: "go" [10:43:32] Nothing to commit [10:43:40] I wonder if it now keeps waiting or stuck or what [10:53:05] Amir1: the rolling restart script never blocks on repooling, can we use the same code in the cookbook? [10:54:34] that was my comment [10:55:37] you are just calling host.repool or host_section.repool and it's just working so that could be an option [10:57:40] marostegui: hello! i think clouddb1016:x3 is still missing the wikidatawiki_p database [10:58:00] taavi: Doing it now [10:58:26] taavi: done [11:01:53] Amir1, marostegui: if that works for you we can use the "repool" function from host.py in the cookbook, either by deploying auto_schema where needed or by copypasting just the repool function [11:02:35] for now, we need to discuss how the diffing works and whether it's needed [11:02:55] once that's resolved, the end result will be logically the same [11:07:13] Amir1: in my understanding your code is showing the diff in stdout during rolling restarts but not erroring out. That's good enough for me AFAICT [11:53:44] is is ok for me to retry es7 backups? [11:53:52] jynus: go for it [11:53:57] @ es2040 ? [11:53:59] jynus: I am doing a switchover and a reboot on codfw [11:54:06] jynus: es2040 should be good [11:54:14] let me know when finished, I can wait [11:54:24] ok, should take only a few mins [11:54:26] also es7 is RO now [11:54:39] no worries, I prefer to wait until things are stable [11:57:00] and of course the master got unestable [11:57:05] it is unbelievable [11:57:09] I am going to open a bug report [11:57:13] :-( [11:58:02] I am going to attach a gdb [12:05:05] all fine and I am now enabling writes again [12:05:09] Will open a mariadb bug in a bit [12:18:05] if the (de)pool cookbook is causing so much annoyance, what's preventing to implement any of the fixes proposed in T383760 (including removing the check completely if that's what your team wants) [12:18:05] T383760: dbctl: expose diff via API in a more structured way - https://phabricator.wikimedia.org/T383760 [12:23:32] kwakuofori: ^ [12:23:38] federico3: ^ [12:25:50] if we can find consensus on what the team wants I'm happy to implement it: marostegui you wrote "fine by me" regarding the disabling the check? [12:26:17] federico3: My last comment is still there: https://phabricator.wikimedia.org/T383760#10811594 [12:26:29] I had some chat about this with elukey. I argued that even if we want to implement it, let's do it separately from the introduction of the cookbook since this is adding more moving parts, once the repool/depool cookbook is stable and polished, then we can add improvements. I don't know if I managed to convince him but worth putting it out here [12:27:47] I've opened the semi sync bug https://jira.mariadb.org/browse/MDEV-36934 I am not sure if it captures everything, but it can be a starting point [12:34:48] marostegui: I'm aware. If we can find consensus on a desired solution I can implement it. If we are ok with removing the check that should be pretty quick to do [12:35:54] federico3: I am fine with either solutions Scott suggested if that addresses the bug [12:36:56] the thing is that, removing the check is the third way [12:37:45] there is a check already to avoid making changes if the diff is not empty [12:37:49] Please decide some of them among yourselves and go for that one. This bug is bitting us everyday pretty much, let's get it solved especially if 2 solutions are already there waiting and one of them seems to be easier [12:37:50] that's not being touched [12:44:51] kwakuofori federico3 Amir1, please decide on which solution needs to be implemented, 2 of them are already on the task done by Scott and one (disabling the check) seems to also be there and easy. [12:46:25] I vote for deleting it for now to unblock the issue [12:46:39] then we can calmly check what options we have [12:47:15] Works for me [12:47:50] So let's update the task and say that we are going for that [12:48:02] ok [12:48:31] Thank you all [12:50:27] I'm updating also the related tasks [12:50:41] excellent thank you [13:10:18] marostegui: do you need clouddb1016:x3 to be down while you set up 1020? we were currently routing x3 to 1016 I think [13:10:37] dhinus: yes, it needs to be down [13:10:54] ok. taavi should we route back x3 to s8? [13:11:02] There shouldn't be traffic arriving to 1016, we are still setting up everything [13:11:33] ack, I'll revert that patch then [13:11:35] I think we wrongly assumed 1016 was completed, sorry :) [13:11:46] taavi: +1 [13:11:51] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153144 [13:12:09] I think the idea was to test that all was good on 1016 and then clone the other host [13:12:45] marostegui: makes sense, and no big deal, just a small misunderstanding! [13:15:08] revert is done [13:15:08] Amir1, volans - o/ sorry I was in a meeting - I think that we could do as Amir suggested: 1) we decide what is the road that we want to take for the diff 2) At some point in the future, we introduce it to the cookbooks. In the meantime, we do have the functionality coded/ready/test/etc.. [13:15:14] how does it sound? [13:15:47] SGTM [13:23:34] all right, so I noticed that federico3 reported the check_diff decision on the task, let's find some time to decide how the structured diff should look like [13:43:58] more precise - could somebody from DP comment on the task by end of week with the decision? [13:44:53] federico3 kwakuofori ^ [14:05:57] I'm discussing it with Amir1 [14:43:46] elukey: he is going to get back to you about this [14:50:42] ack thanks! [15:19:29] Can I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138831 please? I've checked that node has 0 weight (see a comment on the CR), and it now needs removing so it can be reimaged on the new-style VLAN and then restored. [15:24:46] Thanks :) [15:32:20] taavi: you can enable clouddb1016 back [15:32:31] I think I will have clouddb1020 ready by tomorrow [16:29:22] marostegui: in updating T395241I also ticket some es* hosts and es2048 now showed up [16:31:00] Yes that's a new host [16:33:59] is es7 available now for backup? [16:34:03] marostegui: ^ [16:34:45] Yes! [16:34:49] It should all be fine [16:35:06] Sorry I forgot to ping you, too many things on my plate [16:37:45] no prob, I guessed [16:37:49] but wanted to ask first [16:38:57] marostegui: thanks, doing [16:40:03] backup now retrying, should take around 5 hours [16:43:27] marostegui: I think these are missing some grants, I can log in with my user but can't see the _p database [16:43:31] it's also missing heartbeat_p [16:44:30] taavi: clouddb1016? [16:44:34] yes [16:48:36] any tips on how to terminate some queries stuck in "Killed" since 8 days ago? (on clouddb1019@s4) :) https://phabricator.wikimedia.org/T390767#10879848 [16:52:25] I'll check tomorrow taavi [16:53:32] dhinus: you cannot rekill a killed query [16:54:08] Just wait (or if it's very urgent, restart mariadb) [16:54:14] I think they are still causing locks, but I'm not sure how to get rid of them. do you expect "systemctl stop mariadb" will help or also get stuck? [16:54:46] I can wait, but I see from wmf-pt-kill that the "kill" was issued on 2025-05-26, so I don't have much hope they will eventually complete [16:55:27] dhinus: I'd try a stop before any sort of kill [16:55:37] ok let's try, it's depooled anyway [16:55:40] thanks! [16:59:00] "[Warning] /opt/wmf-mariadb106/bin/mysqld: Thread 33282161 (user : 's52788') did not exit" [16:59:14] this was logger in mariadb journal for the 4 stuck threads [16:59:17] *logged [17:00:16] the shutdown is not completing, I'll wait a bit [17:08:28] still stuck :( [17:10:00] the mysql process is VERY active (366% CPU). I'll set a longer downtime and check back in a few hours [17:14:51] dhinus: kill it, it if didn't stop in 20 minutes, it will never stop [17:15:29] jynus: any idea what is doing/trying to do? :) [17:16:28] would you just "kill -9" the mysql process? [17:20:07] if it doesn't stop after 20 minutes, yes [17:24:08] ok, fasten your seatbelts :P [17:26:17] 1889 ? Zsl 259817:35 [mysqld] [17:26:38] the process is STILL there () after a kill -9 [17:26:58] dhinus: it will clean itself and restart [17:27:03] ok! [17:27:09] Just grep the log and you'll see it coming back [17:28:57] it did disappear after about a minute and "systemctl stop" completed [17:29:08] now trying systemctl start [17:29:33] ] InnoDB: Starting crash recovery [17:30:10] up & running [17:30:38] starting the slave [17:31:14] it's catching up [17:35:48] back in sync and repooled! [17:35:51] thanks for the support :) [22:19:36] Scariest thing I've run in my career, I lost track of number of times I checked the host and the command [22:19:39] > sudo db-mysql db1255 -e "use wikidatawiki; show tables" | grep -v wbt | xargs -I{} bash -c "sudo db-mysql db1255 -e 'use wikidatawiki; drop table if exists {};'"