[00:29:32] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:53:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:32:21] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:27:21] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:58:38] hello folks! [07:59:04] I think that Scott is totally right, I focused on the loopback IP for bare metals and not k8s workers [08:07:19] filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1120893 [08:07:22] * elukey stupid [08:30:21] also quick one for the etcd-backup alerts https://gerrit.wikimedia.org/r/c/operations/puppet/+/1120899 [08:30:52] I think that we should remove the timer until we fix the issue (see related task), and also we are the only ones doing backups for k8s [08:31:06] not entirely sure if it makes sense or not, maybe something to bring up to the k8s sig [08:58:28] +1d [09:02:37] thanks :) merged and cleaned up [09:04:32] RESOLVED: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:25] swfrench-wmf: you got it, I don't see anymore the issue now that there is the loopback ip [09:11:28] cc: topranks: --% [09:11:36] --^ :) [09:11:45] so yes Luca's fault as root cause [09:11:45] sigh [09:12:28] now I also see some proper traffic in grafana for the k8s pods [09:15:43] \o/ [09:33:28] elukey: nice! [09:34:21] I figured the k8s host was routing the packet back to the CR for some reason - and thus it was being sent back again to the LVS - and looping around [09:34:31] Makes sense nice work! [09:35:02] topranks: well not really nice since I misconfigured it, Scott was the one that got it :D Sorry for the debug session! [09:35:39] I would have thought on the K8s host it routed that to a pod though, rather than it being in the loopback? [09:35:52] that’s my lack of k8s knowledge [09:37:08] topranks: IIUC if we don't use LVS (l2 at least) the routing never really touch the loopback, but in this case every k8s worker expose the service port (a nodePort setting) so it is the same as another bare metal server :) [09:37:30] and then kube-proxy routes the request to the worker with the pod on it [09:37:34] so you can hit any [09:37:39] Ah ok nodeport yep seen that [09:38:12] here is a question - can the k8s nodes do IPIP? [09:40:36] IIUC it is being worked on, we need to clamp MSS as well [09:40:42] ah yes https://phabricator.wikimedia.org/T352956 [09:50:19] ah yeah was speaking to Alex about it [09:50:58] the mss clamping is needed because of the extra headers in the IPIP encapsulation right? [09:51:44] so when the egress traffic is sent back to the load balancer there is no risk of fragmentation issues etc.. [10:03:54] elukey: yeah basically the headache is the extra 20 byte header the IPIP adds when wrapping up a clients packet between LVS and realserver [10:04:43] one way to deal with it is to “clamp” or re-write the MSS field of TCP packets, which will prevent the other side sending packets bigger than N bytes [10:05:19] and this you can ensure you’ll have the extra headroom to add an extra IPIP header without exceeding MTU [10:06:25] I think for the k8’s host the plan is actually to do something else though. With lower MTU in K8s pod but still higher MTY on the host. I commented on the task. Quite ingenious and elegant. [10:11:03] ahhh right nice [10:31:49] and it's not a hack, it's made for that and it's a calico option :) [10:51:37] 10netops, 06Infrastructure-Foundations: cr2-esams:interface ae1 present under protocol ospf but not configure - https://phabricator.wikimedia.org/T386766#10562845 (10ayounsi) a:03Papaul Thanks! You can remove the now obsolete references from the `ospf` section in https://github.com/wikimedia/operations-homer... [10:57:56] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539#10562858 (10ayounsi) 05Open→03Resolved a:03ayounsi Nop, thanks for the ping. There is now {T364092} [11:15:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [11:25:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [12:20:53] 10netops, 06Infrastructure-Foundations, 06SRE: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807 (10cmooney) 03NEW p:05Triage→03Low [12:20:59] 10netops, 06Infrastructure-Foundations, 06SRE: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10563148 (10cmooney) [12:21:01] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10563149 (10cmooney) [13:00:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:05:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:20:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:36:23] 10SRE-tools, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10563373 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I've modified the sre.hardware.upgra... [14:54:17] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10563685 (10ayounsi) Thanks ! >>! In T384731#10556225, @fgiunchedi wrote: > Since we have to overwrite... [16:00:28] 10netops, 06Infrastructure-Foundations, 06SRE: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10564100 (10cmooney) [16:26:50] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10564274 (10cmooney) >>! In T384731#10563685, @ayounsi wrote: >>! In T384731#10556225, @fgiunchedi wrot... [19:08:39] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10565308 (10cmooney) >>! In T384731#10563685, @ayounsi wrote: > Is it possible to duplicate the metric,...