[00:38:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [01:08:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [06:13:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [07:13:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [08:19:25] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: SwitchCoreInterfaceDown (instance ssw1-f1-codfw:9804) - https://phabricator.wikimedia.org/T404946 (10LSobanski) 03NEW [10:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:08] ^ expected, install1004 is being shut down, I'll silence it [11:19:58] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11193185 (10cmooney) So draining traffic from the node did not go as planned. This config was applied: ` set protocols bgp graceful-shutdown sender set routing-instanc... [11:42:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959 (10cmooney) 03NEW p:05Triage→03High [11:43:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11193246 (10BTullis) Hi. In case it helps with your investigation, I can tell you that we observed a brief loss of connectivity on the dse-k8s cluster, which may well ha... [11:59:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 06SRE: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11193277 (10cmooney) >>! In T400783#11193246, @BTullis wrote: > Hi. In case it helps with your investigation, I can tell you that we observed a brief loss of connectivit... [14:55:17] topranks: how difficult is it to setup port mirroring on one of our juniper switches? Supermicro suspects our network is at fault for https://onsite.supermicro.com/index.php?/Tickets/Ticket/View/185807, which I think is highly unlikely. I was thinking a packet caputure from the host port would be pretty definitive. [14:56:48] jhathaway: I've not done it before tbh so I'll need to investigate. Shouldn't be too tricky I guess. A remote capture might be harder but probably if we get a test host connected to the same switch, mirror to that and do tcpdump on it we can see [14:56:57] I'm just about to jump on a call I'll investigate after [14:57:32] no rush at, happy to chat about other options when you have a moment [14:58:08] s/at/at all/ [15:24:48] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11194244 (10Papaul) {F66055737} [15:24:54] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11194245 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6fac31b1-92f6-4bf9-bf95-d9862483e9b6) set by cmooney@cumin1003 f... [15:31:11] FIRING: PfwCoreBGPDown: ... [15:31:17] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.200) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [15:46:20] RESOLVED: PfwCoreBGPDown: ... [15:46:25] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.200) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [16:04:09] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11194406 (10Papaul) out put of todays' troubleshooting Last login: Tue May 20 13:04:15 on ttyu0 --- JUNOS 23.4R2.13 Kernel 64-bit JNPR-12... [16:12:04] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [16:16:09] 10Mail, 06Infrastructure-Foundations, 06SRE Observability, 07Epic: Parse DMARC reports and create a dashboard from data - https://phabricator.wikimedia.org/T404888#11194504 (10jhathaway) [16:33:08] 10Mail, 06Infrastructure-Foundations, 06SRE Observability, 07Epic: Parse DMARC reports and create a dashboard from data - https://phabricator.wikimedia.org/T404888#11194606 (10jhathaway) Observability, Would it be acceptable to store the data from the parsed DMARC reports in OpenSearch? My initial estimat... [16:41:17] jhathaway: what host is it you might need to do the port mirror for? [16:41:28] I don't have a super-micro account so I can't actually see the ticket [16:41:48] sretest2001 [16:42:52] There is a smidge more info in https://phabricator.wikimedia.org/T383173 [16:43:08] happy to get you a supermicro account, or forward the thread to you [16:44:34] I sent you the *humongous* thread, no need to read it, but if you want to search for anything in it [16:45:09] I'm open to other ideas, or feel free to say no, just trying to figure out the best way to exonerate our network stack [16:58:05] yeah it might be the best option [16:59:18] so.... the download starts, but it stops half way through? [16:59:26] is the pcap in the phab task? I don't seem to see it [17:00:12] jhathaway: and this "stopping half way" doesn't happen if the supplied url used an IP address? [17:03:37] The dhcp request always succeeds, then pxe trys to fetch the HTTP URI, http://apt.wikimedia.org/efiboot-systemrescue/snponly.efi [17:03:53] on warm boot this works, but on a cold boot it fails [17:04:14] in the pcaps on the warm boot, we see a HEAD and then a GET for the URL [17:04:51] on the cold boot, only a GET request is sent, then they appear to overfill a buffer and request hangs [17:05:07] if we replace apt.wikimedia.org with its IP, it always works [17:05:28] from their debug log on a cold boot, the initial HEAD request fails [17:05:41] but we don't see any evident of that request in the pcap [17:05:46] *evidence [17:05:57] so I am not sure it is being sent out from the NIC [17:06:04] I'll send you the pcaps [17:06:58] ok yeah definitely does not sound like that could be the network [17:07:33] I guess they are just trying to blame anything. obviously we do not secretly fix the network when you warm boot :) [17:07:42] :) [17:07:50] they overfill which buffer? their tcp send buffer? [17:09:30] arz.hel took the orignal pcaps, his comment "For what I see, in a warm boot UEFI does a HTTP HEAD before the GET, maybe to get the full size of the .efi file ([Content length: 207360]) and thus calibrate its buffers for the real download. This HEAD doesn't happen during a cold boot." [17:10:15] yeah there definitely is a logic to that alright [17:10:36] on a cold boot the receiver tcp buffer fills up, and the tcp connections stalls [17:10:57] ok yeah [17:11:14] that's definitely an application layer issue, it's not pulling the bytes off the tcp receive buffer and the flow stalls once its full [17:11:31] but we can do the mirror test probably if it will get them beyond that - I know what support can be like [17:11:59] looking at this email - it's cool to see that weird on-board RS232 connection haha [17:12:04] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [17:12:06] yeah, the mystery is why their HEAD request fails, their debug logs didn't provide any more detail, other than that it does fail [17:12:34] indeed the rs232 dongle feels like some retro debugging [17:13:41] nice set of instructions on how to use putty too. highly nostalgic :) [17:13:59] when we have a cold boot but use the IP do we see the HEAD request? [17:14:14] yes [17:14:26] so it seems to be some side effect of the dns lookup [17:15:04] * jhathaway is tempted to run putty in wine for old times sake [17:18:39] listen even though it's a long shot I think a pcap might be a good idea [17:18:57] we can try all of the scenarios again: cold boot + fqdn, cold boot + ip, warm boot + fqdn [17:19:00] and save pcaps [17:19:29] from the host side we'll see the DNS and everything else. We know it's not the network but perhaps we might notice some difference that could help [17:19:42] i'd not jump to doing it - but you've obviously been at this for a while [17:19:55] FIRING: MaxConntrack: Max conntrack at 81.64% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [17:20:24] I think that would be great, but if you think the mirror port effort is too much, I happy to close and just use the IP as the workaround [17:21:07] given how long this ticket has been going on, my confidence is not high on an eventual resolution, but I'm also curious to get to the bottom of the issue [17:24:55] RESOLVED: MaxConntrack: Max conntrack at 81.83% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [17:26:03] topranks: I'm stepping away for lunch, have a good evening [17:26:05] jhathaway: just looking at the Juniper docs it seems fairly straightforward to do [17:26:17] what we need is to get another sretest host in the same rack [17:26:49] nod, or we can perhaps high jack a port on another host, running the pcap shouldn't be too invasive [17:26:50] probably want to connect two interfaces on it - one we will use to reimage it and ssh on, and the other we will configure as the mirror port to run the tcpdump on [17:27:37] yeah we could use another host alright - I think we want the second port for the mirror to make sure it's clean (not sure if we can do anything else with how it works on the switch side tbh) [17:28:21] I love Juniper's example "spy on your workers" config: [17:28:26] set analyzer employee-monitor output interface xe-0/0/47.0 [17:28:31] ha! [17:28:33] amazing [17:30:09] there are a few ganeti hosts in that same rack. perhaps we could drain one of VMs to be sure and then use it? [17:30:19] anyway get your lunch we can chat tomorrow about it [17:30:29] sounds good, thanks [17:55:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11195081 (10RobH) [18:06:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11195144 (10RobH) [18:36:45] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11195295 (10Papaul) update from Juniper after our phone call today. ` Hello Teams, ​ Thank you for your time on our call. ​ During our call w... [19:52:30] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11195545 (10cmooney) > It seems to be a part of the router's design. Can't believe they did all that, made us drain it and reboot and even r...