[07:56:36] I am sending 30 TB over the eqiad-codfw cross link FYI over the next days [08:13:16] traffic is happening over TLS from backup1013 -> backup2003 and from backup2013 -> backup1003 [08:18:10] jynus: thanks for the head's up. Is it going to use the full 10G of the host? [08:21:11] current rate is 250 Mbytes/s [08:22:18] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=2025-06-25T07:21:59.227Z&to=2025-06-25T08:21:59.227Z&timezone=utc&var-server=backup1013&var-datasource=000000026&var-cluster=backup&refresh=5m&viewPanel=panel-8 [08:23:33] jayme: are you actively throttling it to 2gbps? [08:23:38] urgh.. damn nick completion [08:23:40] jynus: :) [08:24:14] 👋 :) [08:24:26] jynus: BTW https://gerrit.wikimedia.org/r/c/operations/alerts/+/1163698 [08:25:00] 👍 [08:59:29] I feel like CI is super slow those days. It takes ~9min for the spicerack repo, 6min for the cookbook repo [09:01:03] hmm, looking at older CRs, it seems like it has always been like that [09:52:00] Time is just wibbly wobbly and hides it from you [09:53:22] <_joe_> XioNoX: I wonder why it might be so slow 🤔 [11:43:17] it's because it's orbiting a black hole [12:34:07] any objection that I try some re-image on the sretest2001 and sretest1001 servers? nobody using them? [12:34:47] not sretest1001 [12:34:58] it's the trixie test host and there's some WIP things on it [12:35:55] ok sretest1002 then? [12:59:27] fine with me! [13:00:26] when the trixie is released, we should also decom 1001 for good, it's over seven years old [13:03:52] swfrench-wmf: I'm around so whenever you want :D [13:05:21] o/ [13:05:48] vgutierrez: just sat down at the computer and will start getting set up :) thank you! [13:22:49] oncallers, FYI: I'm soon going to start migrating the nginx-based proxy in front of the "main" etcd cluster (i.e., conftool and friends) to cfssl-based PKI. [13:22:49] this will start with one host in codfw, then expand across all hosts there, eventually moving on to eqiad (though that might wait for another day, depending on speed). [13:22:49] I'll be watching / verifying a _lot_ of things as I go, but intend to be very loud about what I'm doing in case there are surprises :) [13:23:15] gl swfrench-wmf! [13:23:34] dogspeed! [13:24:13] thank you :) [13:36:12] migration is starting now on conf2006 [13:44:00] updates: conf2006 is looking good. change applied cleanly, access logs look right, manual testing with curl, confctl, and etcdctl works [13:44:25] \o/ [13:45:42] \o/ [13:47:13] the one gotcha that I didn't expect is that nginx may(?) have closed connections during the migration itself, which I didn't expect (since a "normal" certificate rotation should be hitless). [13:47:47] wondering if it might be the result of the change in TLS certificate _paths_ (i.e., not just content). [13:50:13] certificate rotation is hitless by default? [13:50:23] that's new :D [13:54:07] hitless in the sense that all worker threads using the old cert will complete and the master thread will serve new connections with the new one, right? [13:54:24] that's what I mean by hitless, yes [13:54:26] swfrench-wmf: so liberica on DCs that I expect connecting to confd@codfw experienced a bump at 13:37 [13:55:32] vgutierrez: yes, that would be the same effect - if indeed nginx closed the connections, then that would terminate the watches [13:55:52] so the connection ending didn't trigger any issue on its own [13:56:14] but... [13:56:24] Jun 25 13:37:52 lvs4009 libericad[433060]: time=2025-06-25T13:37:52.773Z level=ERROR msg="unable to watch" key=/conftool/v1/pools/ulsfo/cache_upload/cdn error="401: The event in requested index is outdated and cleared (the requested history has been cleared [5966322/5964425]) [5967321]" [13:56:24] [13:56:47] that's an unrecoverable error [13:57:03] (from etcd perspective) [13:57:04] oh, that's an interesting bug! yeah, it sounds like it's not doing index recovery? [13:57:26] so liberica solves that by respawning impacted control planes [13:57:28] i.e., the protocol for "catching up" when your previous watch index slips out of the 1k window [13:57:38] that's a solid solution :) [13:58:01] now I'm wondering how it looks on codfw that we still have pybal running [13:58:16] so, I've not touched the pybal host yet :) [13:58:30] oh right, pybal is smart/dumb enough to use only one host [13:59:18] going back, though: do you know for certain whether a cert reload (not quite what we're doing here, since we're changing the cert *paths* as well), is hit-ful for existing connections? [13:59:36] if so, that's probably not going to work for at least once etcd client (etcdmirror) [13:59:55] that depends on the TLS terminator [14:00:02] for haproxy it would be a hit-ful event [14:00:21] so you need to kill the connections using the old cert at some point [14:02:25] oh, wait ... I see what happened [14:02:27] https://phabricator.wikimedia.org/P78683 [14:02:31] * swfrench-wmf facepalms [14:03:28] :) [14:03:36] nginx dying definitely explains it ;P [14:04:51] * swfrench-wmf starts reading through `cfssl::cert` [14:10:17] vgutierrez: about the haproxy thing, haproxy does consider updating certs through the API (set ssl cert) thing to be hitless. [14:10:41] sukhe: definitely not for existing connections [14:10:42] old connections use the old certs till the connection closes and then pick up the new certs. or am I mistaken? [14:10:55] right.. that's a problem for etcd connections [14:11:05] where a watch operation happens during days, weeks or months [14:11:42] alright, so, I see what happened ... nginx is loading the chained cert, and it seems like there's a race here on the initial migration - i.e., the exec notifies when the cert is ready, _but_ only later is the chained version of the cert ready [14:11:55] (and we get a second notify, which "fixes" things) [14:12:21] so, the nginx *restart* is a one-time artifact of the migration, due to raciness in the notifications [14:13:52] ack [14:14:48] also, I've confirmed again that the nginx docs seem to indicate that a config reload triggers worker processes to start shutting down (stop accepting new conns), but does not make them terminate existing ones [14:15:01] (i.e., workers stick around until existing cons go away) [14:15:55] (i.e., using the previous certs) [14:18:03] vgutierrez: so, given what we've learned - i.e., that there's a notification race on initial conversion, which can trigger an nginx restart - how do you feel about proceeding? [14:18:03] I can revert conf2006 pretty easily and try to improve the puppetization for another day [14:18:25] for liberica isn't going to be a problem [14:18:44] for pybal we will need to restart the 3 instances in codfw [14:18:48] s/3/4/ [14:18:59] not a huge problem IMHO [14:19:07] yeah, I was planning to restart them anyway to confirm everything comes back as expected [14:19:31] the only problem is that the 4 of them will fail at the same time [14:19:40] indeed :) [14:19:43] unless you switch the hosts to different confd instances first [14:20:06] I should probably know this, but: is there a way to mask delivery of reloads to a systemd unit temporarily? [14:20:37] if I can do that somehow, I can prevent nginx from observing the race :) [14:21:38] that's above my systemd kung-fu [14:25:07] I'm not aware of such a feature [14:25:28] alas, yeah - I can't find anything like that in the docs [14:27:34] alright, checking the time, I need to make a decision on whether to proceed or not [14:27:34] I think the right call is to revert conf2006 and give some thought to whether the puppetization can be trivially improved to address this [14:27:34] the restarts are "fine" in that we need to be able to accommodate connection shutdown anyway, but it's not quite what I prepped for for today :) [14:28:10] (i.e., it triggers a different set of possible failure modes for certain clients, that change my strategy a bit) [14:34:19] <_joe_> swfrench-wmf: please proceed without fear [14:34:42] <_joe_> the worst that you might need to do is restart confd fleet-wide after you're done [14:34:44] <_joe_> and pybal [14:40:27] thanks, _joe_ - yeah, this should all be doable, and in fact, we'll have to do exactly this for the etcd-side part of this, since that will indeed shut down connections [14:42:07] I decided to revert for the moment, mainly given the hour and to give myself a chance to collect my notes on what I was going to do for that case :) [14:44:29] alright, cleanup: [14:44:29] * I'm going to restart codfw-associated confds, given that their clients may now be ignoring conf2006 due to the connection drops (an issue we've seen previously) [14:44:29] * vgutierrez: do you think liberica cp daemons need a similar treatment, given that they use the same underlying client? [14:45:00] hmm let me check [14:45:19] but I don't think that's true for liberica [14:46:19] I can confirm that lvs5004,5006,4010 and 4008 currently have connections established with conf2006 [14:46:34] given how liberica reacted, it spawned new instances of the etcd client [14:46:45] so no shared state [14:47:33] great, yeah given how it handles watch termination, I _think_ this shouldn't be a problem [14:48:34] it attempts to retry on watch termination btw, but etcd said it couldn't be retried [14:49:47] ah, *that's* the point at which you saw the cleared-index error, got it [15:00:15] alright, final update on my end for this (oncallers): initial pilot on conf2006 totally seems to have worked, but has been reverted due a surprising transient behavior. all affected confds have been restarted and all should be back to normal :) [15:43:58] topranks: please let me know once you're done with the puppet merge run [15:44:20] Amir1: no problem, are the changes to tables-catalog.yaml yours? [15:44:48] yup [15:45:12] ok I will go ahead and merge them then? [15:45:36] sure [15:45:38] thanks [15:46:41] Amir1: ok done [15:46:55] Thanks [18:19:45] swfrench-wmf: there is a swift patch by waiting for merge [18:20:27] Amir1: yup, just merged that - feel free to go ahead with it since you appear to have beaten me to the lock :) [18:20:27] should I merge? [18:20:34] :D [18:20:39] thank you! [18:20:46] done [18:21:04] thank you for merging after I'm done with my oncall shift