[12:41:24] 10netops, 06Infrastructure-Foundations, 06SRE: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606#11419139 (10MoritzMuehlenhoff) Rancid is a bit of a maze of scripts calling each other, but I could eventually track it down to /usr/bin/control_rancid. In our case, the... [13:17:14] 10netops, 06Infrastructure-Foundations, 06SRE: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606#11419237 (10MoritzMuehlenhoff) Also reported to Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1121730 [13:25:31] 10netops, 06Infrastructure-Foundations, 06SRE: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606#11419246 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Updates have been rolled out and diffs are being sent again. [14:02:10] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: MAGRU power maint - CHG0262056 - October 29-30, 2025 - https://phabricator.wikimedia.org/T408589#11419402 (10ayounsi) 05Open→03Resolved a:03ayounsi I guess we're good here. [14:28:20] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11419516 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host cp2043.codfw.wmnet with OS trixie [14:33:03] 06Traffic, 10DNS, 06serviceops, 06SRE, 07Language codes: Redirect legacy language codes for Toki Pona to tok.wikipedia.org - https://phabricator.wikimedia.org/T404507#11419541 (10taavi) 05Open→03Resolved a:03taavi [14:38:07] 06Traffic, 06SRE, 13Patch-For-Review: Meta query about why we map 31.13.103.0/24 to US - https://phabricator.wikimedia.org/T409735#11419563 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [15:19:34] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11419755 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host cp2043.codfw.wmnet with OS trixie executed with err... [15:29:55] 10netops, 06Infrastructure-Foundations: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11419801 (10ayounsi) p:05Triage→03Low [15:31:19] 06Traffic, 06Infrastructure-Foundations, 06SRE, 10SRE-tools: Reboot cookbook workflow leaves Puppet disabled - https://phabricator.wikimedia.org/T410944#11419817 (10elukey) p:05Triage→03Medium [15:38:27] 06Traffic, 06SRE: Meta query about why we map 31.13.103.0/24 to US - https://phabricator.wikimedia.org/T409735#11419845 (10SLyngshede-WMF) @cmooney can you let the people from Meta know that this should be fixed now? [19:55:05] If anyone is around, I could use a hand with pybal for cloudweb hosts. So far everything I've touched has gotten worse so I'm reluctant to take the next steps [19:55:09] w/out supervision [19:55:24] The primary issue is https://horizon.wikimedia.org/ failing [19:57:06] sorry taavi, I have to bother you more because restarting pybal on lvs1020 caused the sites to go down (regardless of w/not that port is in the healthcheck) [19:57:31] (removing the port being https://gerrit.wikimedia.org/r/c/operations/puppet/+/1213556/1/hieradata/common/service.yaml ) [19:58:20] andrewbogott: so what exactly have you done so far? I see no relevant SAL entries? [19:59:23] taavi: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1213550, restarted pybal on lvs1020, reverted that patch, restarted again [19:59:27] that's it [19:59:42] Oh, and pooled 1003 [19:59:57] andrewbogott: about to head into a meeting and can look in a bit. [20:00:18] so removing the healthcheck is failing and removing the port makes it work? [20:00:59] sukhe: I am pretty sure that restart is what broke it [20:01:08] no [20:01:11] since all I did was change the healthcheck and then change it back; should have no effect [20:01:12] restart of pybal you mean? [20:01:15] yes [20:02:00] So I suspect it was in a broken-but-accidentally-working state before I touched things [20:02:42] so the cloudweb service was configured with an explicit port in the health check url, and the envoy version in trixie gives a 503 with that (and a behaviour change in the envoy version in trixie is my guess why that's only broken now) [20:02:57] the service was up before the restart thanks to pybal's depool threshold [20:03:08] yeah that makes sense so far [20:04:07] (in a meeting so can step in if required) [20:04:16] when I restarted pybal on lvs1020 did that flip traffic over to a different lvs server? [20:04:30] now I'm running puppet on the secondary (1019) and running the pybal reboot cookbook to pick up the config change to remove the port [20:04:40] (if so then puppet+restart everywhere might be all we need) [20:05:25] s/1019/1020/ [20:06:10] ok, so far you are doing the same things I did... [20:10:14] ...except apparently it works when you do it? [20:10:24] yep [20:10:44] grrrrrrrr [20:10:51] ok, well, thank you for the rescue :) [20:10:52] lemme see the logs if there's a clue [20:11:23] I suspect that the difference is restarting pybal in one place and not both places (although I don't understand why it didn't just keep working but complaining on 1019 after I restarted 1020, which is what I was expecting) [20:15:31] andrewbogott: so you restarted pybal at 19:39Z to remove the port, and added it back at 19:48Z, and according the the pybal logs that was very successful: https://phabricator.wikimedia.org/P86257 [20:16:20] yeah [20:16:21] ... but lvs1020 is the secondary one, so that had no impact on user traffic. instead 1019 is serving traffic, and it never had the restart done [20:16:46] it definitely had an effect, because the sites loaded before and didn't load after [20:22:58] ...and now I have to run to a volunteer shift; I'll leave the laptop behind so the sites will be safe from my bad influence [20:23:09] andrewbogott: it went down at which time exactly? something which didn't log to SAL seems to have depooled one of the servers at 19:40Z which is my best guess of the cause of that [20:31:19] It went down as soon as I was able to check after the restart. [20:31:43] 1003 reimage might have gotten lucky and done something at exactly that moment [20:33:08] did we stop pybal and disable puppet at any point? [20:33:28] stopping pybal is equivalent to depooling a host [20:33:36] not to my knowledge [20:33:48] https://grafana.wikimedia.org/goto/eDFVbAZDR?orgId=1 [20:34:26] this would then also explain [20:34:27] > it definitely had an effect, because the sites loaded before and didn't load after [20:34:37] where 1020 was handling low-traffic, when it should not be [20:35:21] so lvs1020 clearly at some point was the active low-traffic host [20:35:38] and if we did not stop pybal then we should look into what happened here [20:35:51] (stepping out for a bit and will wait for a confirmation and then file a task depending on that) [20:37:21] sukhe: I think those spikes match with https://sal.toolforge.org/log/Tx0S25oBvg159pQrPrBD and https://sal.toolforge.org/log/A6GI25oBffdvpiTr5Yji which were before or after anything andrew did [20:40:38] on cloudweb1003 I see andrewbogott ran `depool` at 19:40Z. I don't see any other explanation for the breakage andrew saw around that time, even though according to the Icinga alerts cloudweb1004 was the one pooled thanks to the depool threshold at that time on lvs1019 [20:41:46] the pybal config says 'Merged [enabled|disabled] server' and 'Initialization complete' at that point after it sees the pool change in etcd, not sure if it would log any actual ipvs changes it'd done at that point? [21:17:18] taavi: yeah but a restart of pybal shouldn't trigger what we are seeing in the connections. I don't have a good explanation for why we see a rise in both 1019 and 1020 though.