[07:06:04] Hey folks o/ [07:06:24] I'd need to upgrade spicerack on cumin*, but this time it would be better not to have cookbooks running [07:06:49] there is a change to upgrade datetime to timezone aware objects and it may affect serialization on etcd [07:07:03] afaics there are cookbooks only from marostegui (hola hola) [07:07:18] elukey: yeah, I was going to say, I have one running that will finish in around 1h [07:07:19] is that ok? [07:07:25] yep sure! [07:07:27] I'd rather not kill it (it is a reimage + repool) [07:07:49] marostegui: there is on for es2045 that started on Jun 11, still valid? [07:08:07] elukey: nope, that can be killed, let me do it [07:08:46] elukey: that one is gone, I have another one for es2047 [07:08:53] that is the one that will finish in around 1h [07:08:59] perfect <3 [07:09:35] I am going to upgrade cumin2002 and do some tests in the meantime [07:11:45] Thanks - I will ping you once done [07:18:20] cezmunsta: o/ your spicerack change is live on cumin2002 if you want to test it [07:19:40] elukey: o/ ty :) [07:32:34] elukey: working as expected [07:34:03] super [07:37:10] I am proceeding with the spicerack keys cleanup on conf1007, tracked in https://phabricator.wikimedia.org/T429125 [07:40:20] side note for the Data Persistence folks - I see locks in etcd like /spicerack/locks/cookbooks/sre.mysql.major-upgrade and /spicerack/locks/cookbooks/sre.mysql.restart_sanitarium, vs host-specific ones like /spicerack/locks/cookbooks/sre.mysql.pool:es2047. It may make sense to add the lock_args method override in the first two to be host-specific [07:40:43] federico3: ^ [07:40:58] cezmunsta: ^ [07:44:45] elukey: that may explain an odd issue that I observed with a lock having disappeared by the time a cookbook went to remove it [07:46:44] cezmunsta: the same cookbook or another one? [07:47:03] 2 instances of the same cookbook [07:48:13] I have only noticed it the once though, but it was the major-upgrade cookbook doing a reimage [07:56:06] okok then it may be a problem if two run at the same time [08:01:04] So, it looks like setting etcd_config when getting the Spicerack instance is the way to do that? [08:08:48] elukey: my cookbook has finished [08:10:24] super thanks [08:10:28] elukey: ... which seems to be controlled via lock_args on the cookbook class? [08:10:36] cezmunsta: correct [08:10:57] you can find examples in various cookbooks [08:11:46] Cheers, I will create the ticket to review and ensure that all locks in sre.mysql are explicitly set [08:13:53] elukey: let me know when we can resume, no rush [08:14:58] I am trying to find a way to delete v2 keys in etcd, I get insufficient creds, once I done it I'll upgrade cumin1003 [08:42:38] ok found it, it seems curl -X DELETE to port 2379 and not via etcdctl [08:47:22] the lock for sre.hosts.mysq.restart_sanitarium has timestamp "created": "2024-12-03 13:20:21.607593 [08:47:23] lol [08:47:28] I am going to remove it [08:48:45] all cruft removed, final checks [08:50:40] marostegui: you are free to go [08:52:53] elukey: thank you! [08:56:46] elukey: any chance the upgrade could have broken something? [08:56:56] I just got a super fastr error on a reimage [08:57:01] fast [08:57:44] lemme check [08:57:53] elukey: let me share the error [08:58:20] https://phabricator.wikimedia.org/P94119 [08:58:39] elukey: I didn't even get the "waiting for uptime etc" [08:58:46] It failed right away [09:00:03] snap yes it is related to the datetime change [09:00:20] elukey: Want me to create a task? [09:00:50] nono I need to rollback [09:01:00] elukey: Nessun problema, grazie. Let me know when I can try again [09:01:59] in /var/cache/apt/archives the only version missing is the one I need to rollback to [09:02:06] what a lovely start of the week [09:03:33] better now than on Friday :) [09:03:47] ahh no maybe it is just a fix for the reimage cookbook [09:04:42] hey, who here already closed a wiki and can double check I didn't miss anything in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1301341 ? [09:05:26] marostegui: need 10 mins [09:07:38] elukey: no rush [09:11:03] marostegui: do you mind to test-cookbook https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1302088 ? [09:11:57] elukey: yep, doing it now [09:12:23] elukey: ah wait, we call that one from a different cookbook, that does all the mariadb stuff [09:12:26] but give me a sec [09:13:31] okok I am upgrading the other cookbooks as well [09:13:42] elukey: running it now [09:24:29] marostegui: todo bien? [09:24:40] elukey: si, so far it is now waiting for the reboot [09:24:49] with the usual message [09:26:25] godog: you have pending puppet changes. are they gtg? [09:26:40] brouberol: woops! yes please and thank you [09:26:46] np! [09:27:36] I have updated the code review to reflect the same change to all cookbooks: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1302088 [09:27:46] if anybody has time for a review I'd be grateful <3 [09:27:59] marostegui: lemme know when it crosses this step, so we'll know that the new code is good [09:28:25] elukey: Alora, it is doing it right now [09:28:37] elukey: Generating a new Puppet certificate on 1 hosts: es2037.codfw.wmnet [09:28:45] buenoooooo [09:28:53] siii [09:29:05] buon lavoro! [09:35:45] hopefully it also completes fine [09:36:01] if so I'll probably self-merge the above change, it is easy enough and people will no be unblocked [09:36:04] elukey: yep, doing the first puppet run at the moment [09:57:04] marostegui: some alerts are still not shown as recovered, not sure why [09:57:11] icinga looks clean for es2037 [09:57:30] elukey: It is all fine, some alerts aren't recovered because mariadb isn't started, but the reimage is fine and all went ok [09:57:34] elukey: thanks for the quick fix [09:58:41] thanks for the test! I merged my change so people are unblocked [09:58:48] ping me if you see any issue later on [09:59:15] will do thanks [11:00:27] handover: nothing to report, may it continue [12:42:35] elukey: is this related to the discussion you had with cezmunsta earlier today?: Lock for key /spicerack/locks/cookbooks/sre.mysql.major-upgrade and ID 13aa99d6-637c-415f-9bed-9afd162fda5e not found. Unable to release it. Was expired? [12:44:53] marosegui: that matches the issue that I saw in the recent past [12:45:17] marostegui, cezmunsta: see https://phabricator.wikimedia.org/T429125#12017268 [12:50:27] I haven't removed the mysql.major-upgrade because it got removed correctly by marostegui's cookbook run [12:50:43] marostegui: where did you get it? Blocking error or just a passing-by comment? [12:51:03] elukey: At the end of the cookbook [12:51:09] Updated Phabricator task T429118 [12:51:10] Lock for key /spicerack/locks/cookbooks/sre.mysql.major-upgrade and ID 13aa99d6-637c-415f-9bed-9afd162fda5e not found. Unable to release it. Was expired? [12:51:10] END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:51:10] T429118: Migrate es6 section to Debian Trixie - https://phabricator.wikimedia.org/T429118 [12:52:06] lemme check on etcd [12:54:14] marostegui: did you run that cookbook? Because on etcd I see a lock for sre.mysql.major-upgrade owned by cezmunsta with id d17d577a-f199-4346-a0d9-7d77928ec93b [12:54:30] yeah, I did for es2036 [12:54:39] END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2036: Migration of es2036.codfw.wmnet completed [12:54:41] so if you are using the same cookbook in parallel for various hosts yes that is expected [12:54:55] elukey: yeah, I think cezmunsta and myself are using the same cookbook at the moment [12:55:16] okok so let's override the lock_args method then, adding the hostname as suffix [12:55:22] correct, mine are for s1 [12:55:22] it should be sufficient to avoid this race condition [12:55:49] https://phabricator.wikimedia.org/T429132 [12:57:33] elukey: if we wanted to permit 2 tickets per section then would locking by section instead of hostname work? [12:58:30] https://doc.wikimedia.org/spicerack/master/introduction.html#distributed-locking should have a fairly comprehensive explanation of how it works [12:58:42] cezmunsta: my understanding is that you can add a suffix with lock_args, so your lock will be named /spicerack/locks/cookbooks/sre.mysql.major-upgrade:suffic [12:58:45] *suffix [12:59:01] the more unique the suffix is, the better to avoid race conditions [12:59:56] ack [13:05:32] cezmunsta: FYI that requires to use the class APIs, because the module API (legacy) is the bare minimum for very simple cookbooks. Both lock_args and runtime_description require it. (the latter to avoid https://sal.toolforge.org/production?p=0&q=%22sre.mysql.major-upgrade%22&d= ) [13:06:45] volans: ack and that is OK by me :) [13:17:55] :) [13:23:47] <_joe_> volans, elukey I'm about to deploy a pretty substantial HP upgrade. After this is deployed, I have several puppet changes to merge as well, which might cause some temporary issues to HP [13:24:38] _joe_: ack, thanks for the headsup, we'll redirect the requests to you for manual filtering in the meanwhile :-P [13:24:54] <_joe_> you can just write to etcd by hand [13:26:26] it's only Monday _joe_ [13:27:16] ack! [13:28:51] <_joe_> uhm not sure what's going on, but the service didn't come up after a restart [13:28:56] volans, elukey: just a heads up I am draining traffic on cr2-esams to so I can restart a line card to pick up new port speed config [13:29:43] topranks: okok, maybe let's wait a sec for Joe? I know the two things are unrelated but better safe than sorry [13:29:45] topranks: ack, but esams is pooled [13:29:51] ? [13:29:53] or is it time sensitive? [13:30:02] elukey: no not time sensitive I will wait [13:30:12] volans: yeah that is ok cr1 can handle the traffic [13:30:17] k [13:30:38] was just to know, not to say it wouldn't :D [13:40:37] <_joe_> I... don't understand why hiddenparma is not coming back up [13:41:04] do you need a hand? [13:41:54] <_joe_> not right now, but I might in a few [13:42:04] let us know, we can help look [13:42:05] <_joe_> although I guess i can "just" rollback [13:42:49] <_joe_> uhm, is it possible something went wrong with the deployment maybe? there's something strange in the logs [13:44:16] could it just be... hiding? [13:44:30] (sorry) :) [13:46:28] <_joe_> yeah, not particularly helpful [13:52:57] <_joe_> ok, for now I'll rollback. I'm not sure what went wrong here. I need to go a bit deeper into it [13:53:51] / [13:54:00] wrong window... [13:54:09] ack [14:09:31] heads-up for on-callers present and future - I've changed how pods recover in thumbor, mostly will affect situations where thumbor gets swamped https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1298811 [14:10:10] in effect it means that overwhelmed pods will stay un-ready for longer if they're swamped but will make performance more evenly spread and hopefully less choppy overall [14:18:08] ack [14:25:20] I have some changes for netbox hiera popping out in my reimage step [14:25:37] removal of wikikube-ctrl1005 [14:25:48] new rack and info for dse-k8s-worker1009.mgmt.eqiad.wmnet [14:26:09] and removal of hosts/wikikube-ctrl2006.yaml [14:26:30] cc: btullis jayme if you have context [14:29:08] Thanks. I think that the context may be here: T401441 for dse-k8s-worker1009. The `sre.hosts.provision` cookbook was run against it. That's the only change I know of. [14:29:08] T401441: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441 [14:29:55] okok good enough [14:29:58] thanks! [14:30:56] wikikube should be https://phabricator.wikimedia.org/T418920 [14:31:06] I'll proceed [14:35:57] elukey: 2006 is https://phabricator.wikimedia.org/T406596 ..but I'm not sure about the status [14:38:13] oook thanks [14:38:14] !! [14:52:29] elukey, volans: I am going to now proceed with the traffic drain on cr2-esams if that's ok? [14:53:40] wfm [14:54:02] thanks [14:57:02] +1 [15:02:24] elukey, volans: it seems routing daemon crashed on cr2-esams when the bgp depref was issued [15:02:36] I think things are ok via cr1 but I am going to depool the site just to be cautious [15:03:44] you didn't run it with --good-luck ? [15:12:25] volans: no I ran with --over-confidence and it backfired on me :) [15:15:43] anyway things are back to normal now so I will repool esams again [15:55:25] In puppet code and our data types we say that only systemd units ending in ".service" or ".timer" are valid units. But there is another type of unit, a .mount. For example since trixie there is "tmp.mount" that mounts tmp in memory. I need to mask that in one case. So going to suggest to add .mount to the regex of valid units. [15:55:55] seems reasonable [15:59:52] thx [16:01:34] others type exists too btw [16:01:38] like .path [16:02:49] ack..hmm, considering to add more in the pattern.. or just on demand [16:03:03] watching meeting on GenAI first [16:28:10] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1302205 [16:34:38] <_joe_> I realized we never updated the issue here... the problem we had earlier is a thing that we deemed "slow but acceptable" being O(n^2) but turned out not being [16:35:02] <_joe_> mostly because we have 22k distinct ipblocks in production, a number that in retrospect we should've checked and we didn't [17:50:58] can I get a "just sanity check"-level of +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1302206 please? If I wait it means I have broken puppet on releases hosts or need to leave it disabled. [17:58:27] brett loves systemd :P [17:58:27] ^ [17:58:39] :) [17:59:15] I do like systemd. I don't really care much for the Puppet abstraction we have for it [17:59:17] I did not invent that pattern. just copied from existing types [17:59:29] and changed the "file extension" [18:03:11] We should not have to be managing this kind of work, but +1 [18:04:28] thanks