[08:58:36] One thing I would like to notice is that dborch hosts were put on the other list here: https://phabricator.wikimedia.org/T426560 [08:59:43] jynus: thanks, I will move them [09:01:48] cezmunsta: for the record the es* reboot script is still running (you can see the progress on https://phabricator.wikimedia.org/T426633 [09:02:22] ack [09:04:15] and the slope at https://grafana.wikimedia.org/d/fcnrmzq/mariadb-kernel-versions?orgId=1&from=now-7d&to=now&timezone=utc [09:36:26] I'm rebooting dborch1002 to update the kernel if nobody is using orchestrator right now @marostegui @Amir1 @cezmunsta [10:01:13] federico3: thanks we are in a meeting [10:01:37] can I go ahead? [10:03:09] I guess yes [10:35:53] ok it rebooted, took just few seconds [11:59:44] I have hit an issue running `sudo cookbook sre.mysql.major-upgrade -t T426725 --reimage trixie --repool db1258 wmf-mariadb1011` [11:59:45] T426725: Migrate x3 section to Debian Trixie - https://phabricator.wikimedia.org/T426725 [12:00:40] Something went wrong when entering the password, so it failed. I attempted "retry", but it didn't ask for the password again and so just failed [12:02:30] marostegui: here is that scenario that I asked about :| :D [12:07:13] cezmunsta: you've got the ouput? [12:07:25] Yes [12:16:04] Added you as a subscriber to the paste :) [12:16:12] https://phabricator.wikimedia.org/P92681 [12:17:29] So the idrac acting up? [12:18:06] let me do a cold restart [12:18:11] ack [12:18:18] I can log in there fine, but let's see [12:18:45] elukey: anything rings a bell at https://phabricator.wikimedia.org/P92681 ? [12:18:54] I am going to do a cold restart which doesn't hurt anyway, but just asking [12:18:56] It was either that I caught the enter key/or my paste buffer didn't copy correctly. [12:19:42] I presume that the VLAN prompt then jumped in and so retry != what it needed to retry [12:21:10] The problem is that re-running the cookbook also then fails because it can't connect to the DB [12:22:14] cezmunsta: there are two options, either you start mariadb manually and re-run it again or you run the reimage cookbook on its own, but you'd have to do the rest of steps manually. So let me restart the idrac and we can try again "our" cookbook [12:22:36] kk [12:27:54] cezmunsta: try again [12:28:38] dhinus: was clouddb1015 reimaged? meaning can I start an instance to see if pt-kill starts? [12:29:19] I'm deploying this spicy change, it's noop but if it breaks, it's going to block traffic to port 3306 (and similar) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289382 [12:29:19] marostegui: thanks, will do [12:29:30] (obviously, first disabling puppet, etc.) [12:29:31] marostegui: weird, have you tried to connect to the BMC via root@mgmt? [12:29:41] elukey: yep, it worked all fine [12:29:47] I reset the idrac in case it was just being funny [12:29:53] elukey: cezmunsta is going to try again [12:30:02] marostegui: reimage is done, instances are already running and replicating. [12:30:14] ah ok [12:30:22] only wmf-pt-kill was missing [12:30:46] I realized only after starting the instances [12:31:07] dhinus: yeah only, this is me right now with that "only": https://makeagif.com/gif/hal-fixing-a-light-bulb-from-malcolm-in-the-middle-s03e06-health-scare-bvGCXi [12:31:37] LOL [12:31:51] https://usercontent.irccloud-cdn.com/file/cQjZjuWf/image.png [12:32:31] it's depooled, and I'm not touching it so it's all yours if you need to stop the db etc [12:33:27] thanks - will do [12:34:51] marostegui: I had to start the replica else it errored in `sre.mysql.major-upgrade` with a `ConnectionRefusedError` [12:35:01] Now seems to be proceeding [12:35:04] cezmunsta: that's fine yep [12:35:13] cezmunsta: so the idrac doing idrac things [12:36:35] technically the reimage cookbook is still running, and polling for a successful puppet run, but it completed all the other steps. do you want me to ctrl+c the cookbook run? [12:36:44] (that's for clouddb1015) [12:37:02] marostegui: yep, no paste issue this time :) [12:37:05] ty [12:39:08] nice [12:48:12] dhinus: I think I got it working, but puppet fails with E: Unable to locate package wikireplicas-utils [12:48:15] No clue what that is [12:48:34] marostegui: ah I built that! it's probably also missing from trixie [12:48:40] I can fix it [12:48:57] (that's the new home for all maintain-* scripts that used to be in wmf/puppet) [12:49:05] https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils [12:49:07] ok cool [12:49:16] pt kill is now running [12:49:20] puppet brought it up nicely [12:49:22] nice, thank you! [12:49:25] I will push it to the repo [12:54:09] wikireplicas-utils installed, puppet completed successfully! [12:54:39] nice work, marostegui [12:55:08] okay if I disable puppet an db-all? any objections? [12:55:28] I was about to do a reimage on a misc host, should I wait? Amir1 [12:56:02] jynus: it's to deploy this change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289382 so it should be noop [12:56:18] just being extra cautious [12:56:35] I leave the decision to wait or do it at the same time to you [12:56:37] I see, so assuming it takes little time I will wait [12:56:47] sounds good [12:57:03] not that I am worried, just to make sure it doesn't interact with the reimage [12:58:00] I also have done the full migration already for some of my hosts, it went well, only required a restart [12:58:36] for dbs you may want to do it in steps I guess [12:59:31] yeah, once we get to actually switching to nftables, that's be much more gradual [12:59:45] Amir1: check with cezmunsta - he's reimaging a host now [12:59:59] yeah, db1258 [13:00:24] one thing you should be aware, Amir1 for that next step (not for now), transferpy will not work well between the puppet change and the restart [13:00:45] as it will think the host is using netfilter but not until restart [13:00:55] ah, thanks [13:01:36] it works for both fws, but not during the transition period [13:12:13] the roll out of the puppet patch is done [13:12:44] puppet enabled? [13:13:23] yup [13:13:33] > udo cumin 'A:db-all' 'run-puppet-agent --enable "merging gerrit:1289382"' [13:13:37] thanks! [13:14:24] all good [13:20:47] federico: have you set a configuration for ruff? [13:21:17] This is for dbproxies https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289369 if anyone feels like reviewing them [13:21:40] Amir1: you could start with dbproxy2* [13:22:12] Sounds good. Thanks [13:24:33] I am reimaging db2184, hopefully it doesn't p*ge again, like the equivalent host yesterday on eqiad [14:47:36] federico3: is there supposed to be content when selecting the hyperlink for a host name via https://zarcillo.wikimedia.org/ui/instances ? e.g. https://zarcillo.wikimedia.org/ui/hosts#es1050 for es1050 [14:49:07] nvm it had logged me out :) [14:50:56] UX suggestion would be to not link to hosts when not logged in, or put a message on hosts [15:32:37] interesting tip: when migrating vlan, heartbeat @ orchestrator breaks because of the ip change (which changes the slave id) [15:32:59] so remember to clear heartbeat table after vlan migration / ip change [15:36:22] migrating the puppet module on dbproxies now [15:36:27] (puppet disabled)