[02:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:39] 10netops, 06Infrastructure-Foundations, 06Traffic: BGP settings for liberica - https://phabricator.wikimedia.org/T379164#10558605 (10Vgutierrez) 05Open→03Resolved [09:21:18] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10558739 (10elukey) In the cumin2002 logs I see: ` 2025-02-07 11:40:17,558 jmm 2595123 [DEBUG redfish.py:912 in generation] ganeti1033: iD... [09:56:53] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10558854 (10elukey) All right installed `python3.9-dbg` on cumin2002, and ran the cookbook and used `py-bt` to verify where it hangs: ` (g... [10:10:23] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10558896 (10elukey) I confirm that with spicerack-shell I can see the following hanging: ` >>> pprint(r.upload_file(Path("/srv/firmware/po... [10:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:38] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10558933 (10MoritzMuehlenhoff) >>! In T385873#10558896, @elukey wrote: > I have no idea how long it takes for the BMC to fetch ~200MB of da... [10:51:29] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10559022 (10MoritzMuehlenhoff) Seems I was just too impatient (or unaware how slow it can be for some firmwares), it completed after roughl... [11:32:28] moritzm: I'm running this command on ganeti1028: gnt-instance info grafana1002.eqiad.wmnet. I need to execute a grow-disk command on grafana1002, but the info command has been stuck for 11 minutes so far. Do you think it is safe to proceed with the grow-disk command? (I know the Grafana machine has only one disk.) [11:47:35] it's not stuck, it's queued. I'm currently draining a Ganeti node (1023), so when that is completed, the grow disk command will run [11:48:20] I'll ping you when it's drained [11:48:31] ok, thank you moritzm [14:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:32:57] topranks: o/ around? I'd need a quick brainbounce but only if you have, not urgent [16:34:14] elukey: just in the door from the airport now! but yep fire away I’ll try my best :) [16:35:24] ahahahah nono please have a good flight! [16:35:54] or have you just returned? Didn't get it [16:36:02] anywayyy no problem another time [16:37:10] ah it's ok, I have the laptop open now..... unless it's something super hard :P [16:37:28] I just returned yes [16:37:53] volans will be amazed that I managed to fit all the records and they made it back with me unharmed [16:38:01] aahahah [16:38:14] anyway, I'll start but please stop anytime [16:39:37] so kartotherian.svc.eqiad.wmnet:6543 is currently backed by a mixture of bare metal nodes (maps100X) and kubernetes workers [16:39:38] glad to hear that! [16:40:00] the idea was to slowly depool all the bare metal ones, but yestarday there was an outage while I was doing so :D [16:40:09] while reviewing the various config, I noticed this [16:40:19] https://logstash.wikimedia.org/goto/81bd58d747f495bfd803c404d3e481ee [16:40:29] the probes seem to fail every now and then [16:40:34] now get some rest and don't work [16:40:48] and indeed I see [16:40:48] elukey@puppetserver1001:~$ curl -k -i https://kartotherian.svc.eqiad.wmnet:6543/osm-intl/6/23/24.png [16:40:51] curl: (7) Failed to connect to kartotherian.svc.eqiad.wmnet port 6543 after 3 ms: Couldn't connect to server [16:41:19] but not always, it seems when it hits the k8s pods (I recognize the "good"/successful replies since they have some http response headers that are bare-metal related) [16:41:37] if I hit the kubernetes nodes directly, all works without any error [16:41:57] is there anybody that has an idea why this could happen? [16:42:57] hmmm [16:43:59] very odd [16:44:10] are we using multiple lvses with different meds or all the traffic goes to one for this service? [16:44:13] two consequtive wget's from cumin1002 show the same thing [16:44:15] one works one does not [16:44:22] https://www.irccloud.com/pastebin/048KnA0L/ [16:44:58] volans: we have the same IP listed in service.yaml for three LVS endpoints (where the port varies) [16:45:06] but except from that, nothing else [16:45:44] I tried a for loop to hit all backend nodes but all good [16:46:10] so it seems lvs-related somehow, still not clear why.. I can't explain the no route to host [16:49:30] elukey: one thing I noticed on lvs1019 if you do ipvsadm -Ln and check the 10.2.2.13:6543 block [16:50:17] yeah... in a pcap for the successful connections all looks ok [16:50:33] for the broken one all I see is a SYN and nothing else, though I might be filtering out something important in the tcpdump [16:51:13] the InActConn are all 0 for the wikikube instances, while has values for the maps ones. But might be a red herring on how we route the things [16:53:45] huh it's some routing thing seemingly [16:53:57] TTL exceeded comes back to cumin host the time it fails [16:54:15] https://usercontent.irccloud-cdn.com/file/Hct7eNzE/kartho.pcap [16:54:39] like there was a loop? [16:54:55] indeed [16:55:07] but I was testing with MTR and didn't see anything like that reflected [16:56:37] I am wondering if the mixture bare-metal/k8s plays a role [16:57:21] https://www.irccloud.com/pastebin/u5VsO9pl/ [16:58:17] I don't think it's the routing to the LVS (which is what the above shows), but perhaps the packet forwarded by the LVS to the real servers (bare metal or K8s) [16:58:41] How is it being forwarded on the LVS? Is it IPIP tunnel or layer-2 ? [16:59:35] layer-2 [16:59:41] at least, IIUC [17:01:53] what wikikube instance would the LVS be trying to send to? do you know what IP/MAC it would use? [17:02:21] I've set these in conftool: [17:02:21] wikikube-worker1002.eqiad.wmnet: [kartotherian-k8s-ssl] [17:02:21] wikikube-worker1003.eqiad.wmnet: [kartotherian-k8s-ssl] [17:02:21] wikikube-worker1004.eqiad.wmnet: [kartotherian-k8s-ssl] [17:02:21] wikikube-worker1005.eqiad.wmnet: [kartotherian-k8s-ssl] [17:02:24] wikikube-worker1006.eqiad.wmnet: [kartotherian-k8s-ssl] [17:02:46] then on every worker kube-proxy would take care of forwarding the request to the worker with the pod running on it [17:05:30] need to step afk to pick up Alessandro, we can restart tomorrow if you folks have time [17:05:44] at this point it is probably better for me to totally depool k8s nodes [17:05:51] topranks: ok if I do it? [17:06:00] or are you still testing? [17:06:19] picking wikikube-worker1002 I can see plenty of egress traffic from lvs1019 to that host [17:06:22] so it looks ok [17:06:38] ah okok [17:06:40] weird then [17:06:44] elukey: yeah maybe best to depool for now [17:07:42] done thanks :) [17:08:15] actually no [17:08:46] https://phabricator.wikimedia.org/P73486 [17:09:28] ^^ traffic forwarded by lvs on that vlan is all for maps1008, don't see it for wikikube-worker1002 [17:10:04] really weird [17:10:26] anyway, thanks a ton topranks for the help! If you have time during the next days lemme know :) [17:10:29] please rest tomorrow! [17:11:11] (with the k8s hosts depooled I cannot repro anymore the no route to host via curl, so it is definitely k8s) [17:11:14] (sigh) [17:11:26] have a nice rest of the day folks! [17:11:54] elukey: yep let's think in the morning if we can test it without disrupting the service much [17:12:41] what might help is just to pool 1 K8s host, and then we can try to see what happens to traffic when it hits it (or if it hits it at all) [17:12:57] or even ask alex if we get totally stuck, though he is probably busy with APP stuff [17:49:09] um ... I'll add a comment to the task, but it looks like the LVS IP was never added to the loopback device on the k8s nodes that are kartotherian-k8s-ssl backends [17:49:49] basically, there was never an "equivalent" of adding the service to `profile::lvs::realserver::pools` for those nodes [18:04:05] {{done}} https://phabricator.wikimedia.org/T386648#10560543 [18:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:12:21] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:53:48] 10netops, 06Infrastructure-Foundations: cr2-esams:interface ae1 present under protocol ospf but not configure - https://phabricator.wikimedia.org/T386766 (10Papaul) 03NEW