[02:12:21] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:12:21] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:39:39] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06Traffic: BGP settings for liberica - https://phabricator.wikimedia.org/T379164#10558605 (10Vgutierrez) 05Open→03Resolved
[09:21:18] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06SRE: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10558739 (10elukey) In the cumin2002 logs I see:  ` 2025-02-07 11:40:17,558 jmm 2595123 [DEBUG redfish.py:912 in generation] ganeti1033: iD...
[09:56:53] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06SRE: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10558854 (10elukey) All right installed `python3.9-dbg` on cumin2002, and ran the cookbook and used `py-bt` to verify where it hangs:  ` (g...
[10:10:23] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06SRE: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10558896 (10elukey) I confirm that with spicerack-shell I can see the following hanging:  ` >>> pprint(r.upload_file(Path("/srv/firmware/po...
[10:12:21] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:22:38] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06SRE: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10558933 (10MoritzMuehlenhoff) >>! In T385873#10558896, @elukey wrote: > I have no idea how long it takes for the BMC to fetch ~200MB of da...
[10:51:29] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06SRE: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10559022 (10MoritzMuehlenhoff) Seems I was just too impatient (or unaware how slow it can be for some firmwares), it completed after roughl...
[11:32:28] <tappof>	 moritzm: I'm running this command on ganeti1028: gnt-instance info grafana1002.eqiad.wmnet. I need to execute a grow-disk command on grafana1002, but the info command has been stuck for 11 minutes so far. Do you think it is safe to proceed with the grow-disk command? (I know the Grafana machine has only one disk.)
[11:47:35] <moritzm>	 it's not stuck, it's queued. I'm currently draining a Ganeti node (1023), so when that is completed, the grow disk command will run
[11:48:20] <moritzm>	 I'll ping you when it's drained
[11:48:31] <tappof>	 ok, thank you moritzm 
[14:12:21] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:14:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:19:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:32:57] <elukey>	 topranks: o/ around? I'd need a quick brainbounce but only if you have, not urgent
[16:34:14] <topranks>	 elukey: just in the door from the airport now! but yep fire away I’ll try my best :)
[16:35:24] <elukey>	 ahahahah nono please have a good flight!
[16:35:54] <elukey>	 or have you just returned? Didn't get it
[16:36:02] <elukey>	 anywayyy no problem another time
[16:37:10] <topranks>	 ah it's ok, I have the laptop open now..... unless it's something super hard :P 
[16:37:28] <topranks>	 I just returned yes 
[16:37:53] <topranks>	 volans will be amazed that I managed to fit all the records and they made it back with me unharmed 
[16:38:01] <elukey>	 aahahah
[16:38:14] <elukey>	 anyway, I'll start but please stop anytime
[16:39:37] <elukey>	 so kartotherian.svc.eqiad.wmnet:6543 is currently backed by a mixture of bare metal nodes (maps100X) and kubernetes workers
[16:39:38] <volans>	 glad to hear that!
[16:40:00] <elukey>	 the idea was to slowly depool all the bare metal ones, but yestarday there was an outage while I was doing so :D
[16:40:09] <elukey>	 while reviewing the various config, I noticed this
[16:40:19] <elukey>	 https://logstash.wikimedia.org/goto/81bd58d747f495bfd803c404d3e481ee
[16:40:29] <elukey>	 the probes seem to fail every now and then
[16:40:34] <volans>	 now get some rest and don't work
[16:40:48] <elukey>	 and indeed I see
[16:40:48] <elukey>	 elukey@puppetserver1001:~$ curl -k -i https://kartotherian.svc.eqiad.wmnet:6543/osm-intl/6/23/24.png
[16:40:51] <elukey>	 curl: (7) Failed to connect to kartotherian.svc.eqiad.wmnet port 6543 after 3 ms: Couldn't connect to server
[16:41:19] <elukey>	 but not always, it seems when it hits the k8s pods (I recognize the "good"/successful replies since they have some http response headers that are bare-metal related)
[16:41:37] <elukey>	 if I hit the kubernetes nodes directly, all works without any error
[16:41:57] <elukey>	 is there anybody that has an idea why this could happen?
[16:42:57] <topranks>	 hmmm
[16:43:59] <topranks>	 very odd
[16:44:10] <volans>	 are we using multiple lvses with different meds or all the traffic goes to one for this service?
[16:44:13] <topranks>	 two consequtive wget's from cumin1002 show the same thing 
[16:44:15] <topranks>	 one works one does not 
[16:44:22] <topranks>	 https://www.irccloud.com/pastebin/048KnA0L/
[16:44:58] <elukey>	 volans: we have the same IP listed in service.yaml for three LVS endpoints (where the port varies)
[16:45:06] <elukey>	 but except from that, nothing else
[16:45:44] <elukey>	 I tried a for loop to hit all backend nodes but all good
[16:46:10] <elukey>	 so it seems lvs-related somehow, still not clear why.. I can't explain the no route to host
[16:49:30] <volans>	 elukey: one thing I noticed on lvs1019 if you do ipvsadm -Ln and check the 10.2.2.13:6543 block
[16:50:17] <topranks>	 yeah... in a pcap for the successful connections all looks ok 
[16:50:33] <topranks>	 for the broken one all I see is a SYN and nothing else, though I might be filtering out something important in the tcpdump 
[16:51:13] <volans>	 the InActConn are all 0 for the wikikube instances, while has values for the maps ones. But might be a red herring on how we route the things
[16:53:45] <topranks>	 huh it's some routing thing seemingly 
[16:53:57] <topranks>	 TTL exceeded comes back to cumin host the time it fails 
[16:54:15] <topranks>	 https://usercontent.irccloud-cdn.com/file/Hct7eNzE/kartho.pcap
[16:54:39] <elukey>	 like there was a loop?
[16:54:55] <topranks>	 indeed
[16:55:07] <topranks>	 but I was testing with MTR and didn't see anything like that reflected 
[16:56:37] <elukey>	 I am wondering if the mixture bare-metal/k8s plays a role
[16:57:21] <topranks>	 https://www.irccloud.com/pastebin/u5VsO9pl/
[16:58:17] <topranks>	 I don't think it's the routing to the LVS (which is what the above shows), but perhaps the packet forwarded by the LVS to the real servers (bare metal or K8s) 
[16:58:41] <topranks>	 How is it being forwarded on the LVS?  Is it IPIP tunnel or layer-2 ?
[16:59:35] <elukey>	 layer-2
[16:59:41] <elukey>	 at least, IIUC
[17:01:53] <topranks>	 what wikikube instance would the LVS be trying to send to?  do you know what IP/MAC it would use?
[17:02:21] <elukey>	 I've set these in conftool:
[17:02:21] <elukey>	     wikikube-worker1002.eqiad.wmnet: [kartotherian-k8s-ssl]
[17:02:21] <elukey>	     wikikube-worker1003.eqiad.wmnet: [kartotherian-k8s-ssl]
[17:02:21] <elukey>	     wikikube-worker1004.eqiad.wmnet: [kartotherian-k8s-ssl]
[17:02:21] <elukey>	     wikikube-worker1005.eqiad.wmnet: [kartotherian-k8s-ssl]
[17:02:24] <elukey>	     wikikube-worker1006.eqiad.wmnet: [kartotherian-k8s-ssl]
[17:02:46] <elukey>	 then on every worker kube-proxy would take care of forwarding the request to the worker with the pod running on it
[17:05:30] <elukey>	 need to step afk to pick up Alessandro, we can restart tomorrow if you folks have time
[17:05:44] <elukey>	 at this point it is probably better  for me to totally depool k8s nodes
[17:05:51] <elukey>	 topranks: ok if I do it?
[17:06:00] <elukey>	 or are you still testing?
[17:06:19] <topranks>	 picking wikikube-worker1002 I can see plenty of egress traffic from lvs1019 to that host 
[17:06:22] <topranks>	 so it looks ok 
[17:06:38] <elukey>	 ah okok
[17:06:40] <elukey>	 weird then
[17:06:44] <topranks>	 elukey: yeah maybe best to depool for now 
[17:07:42] <elukey>	 done thanks :)
[17:08:15] <topranks>	 actually no 
[17:08:46] <topranks>	 https://phabricator.wikimedia.org/P73486
[17:09:28] <topranks>	 ^^ traffic forwarded by lvs on that vlan is all for maps1008, don't see it for  wikikube-worker1002
[17:10:04] <elukey>	 really weird
[17:10:26] <elukey>	 anyway, thanks a ton topranks for the help! If you have time during the next days lemme know :)
[17:10:29] <elukey>	 please rest tomorrow!
[17:11:11] <elukey>	 (with the k8s hosts depooled I cannot repro anymore the no route to host via curl, so it is definitely k8s)
[17:11:14] <elukey>	 (sigh)
[17:11:26] <elukey>	 have a nice rest of the day folks!
[17:11:54] <topranks>	 elukey: yep let's think in the morning if we can test it without disrupting the service much
[17:12:41] <topranks>	 what might help is just to pool 1 K8s host, and then we can try to see what happens to traffic when it hits it (or if it hits it at all)
[17:12:57] <topranks>	 or even ask alex if we get totally stuck, though he is probably busy with APP stuff 
[17:49:09] <swfrench-wmf>	 um ... I'll add a comment to the task, but it looks like the LVS IP was never added to the loopback device on the k8s nodes that are kartotherian-k8s-ssl backends
[17:49:49] <swfrench-wmf>	 basically, there was never an "equivalent" of adding the service to `profile::lvs::realserver::pools` for those nodes
[18:04:05] <swfrench-wmf>	 {{done}} https://phabricator.wikimedia.org/T386648#10560543
[18:12:21] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:12:21] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:53:48] <wikibugs>	 10netops, 06Infrastructure-Foundations: cr2-esams:interface ae1 present under protocol ospf but not configure - https://phabricator.wikimedia.org/T386766 (10Papaul) 03NEW