[05:26:55] FIRING: MaxConntrack: Max conntrack at 82.5% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [05:31:55] RESOLVED: MaxConntrack: Max conntrack at 85.41% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [05:46:04] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [06:46:04] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [07:52:55] FIRING: MaxConntrack: Max conntrack at 80.48% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [07:57:55] RESOLVED: MaxConntrack: Max conntrack at 80.48% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:56:52] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: SwitchCoreInterfaceDown (instance ssw1-f1-codfw:9804) - https://phabricator.wikimedia.org/T404946#11201792 (10cmooney) 05Open→03Resolved a:03cmooney All back up now. [15:01:51] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11201828 (10ssingh) Thanks for the discovery and writing this up @Cmooney! No concerns from Traffic since as you mentioned it is the... [15:04:01] 10CAS-SSO, 10WMF-General-or-Unknown, 07Epic, 07Security: All Wikimedia developer services should use single sign-on - https://phabricator.wikimedia.org/T189531#11201842 (10Arendpieter) >>! In T189531#11073700, @Aklapper wrote: > yes Thank you! For Striker I pushed some code here: https://gerrit.wikimedia.... [15:17:52] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Move pfw1b-codfw to rack F5 - https://phabricator.wikimedia.org/T401297#11201962 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7b0c0458-499c-4287-8c6f-8f66dccdba91) set by pt1979@cumin2002 for 2:00:00 on 1 host(s) and their... [15:19:11] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Move pfw1b-codfw to rack F5 - https://phabricator.wikimedia.org/T401297#11201968 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=469fffa6-5667-4b97-b402-ebd2aefae808) set by pt1979@cumin2002 for 2:00:00 on 2 host(s) and their... [16:06:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11202340 (10cmooney) @Jclark-ctr when you have some time can we have a look at this one? No particular rush thanks. [16:08:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:13:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:23:01] topranks: supermicro had a similar suggestion on mirroring the port, though they mentioned capturing a pcap on the juniper switch directly, is that an option we should consider? I forwarded you their email. [16:23:26] jhathaway: hey... I'm actually just testing it out now [16:23:39] capturing directly on the switch is probably not ideal, might be an option in this case where traffic is light [16:23:51] but anyway we have the second link up, I'm just playing with it now [16:24:02] cool, let me know if I can help [16:24:32] it's "sort of" working, but seems to only be mirroring traffic that is going to the Juniper CPU.... i.e. traffic that is handled by the switch ASIC does not seem to be getting captured [16:24:36] leave it with me a few mins [16:25:06] sounds good [16:59:07] oh ffs [16:59:48] jhathaway: been banging my head against this thing for over an hour [17:00:08] turns out my "muscle memory" -p flag on the tcpdump was causing the mirrored packets not to show on the host [17:00:18] ugh, that sucks sorry [17:00:46] the haha nah it's a good lesson for me [17:01:12] because I could see some stuff - LLDP, routed advertisements, I incorrectly assumed the switch was only sending me traffic it could see [17:01:19] anyway all good, let me confirm we are ok [17:01:53] that mode always sounds more illicit to me then it is ;) [17:04:19] haha indeed :P [17:04:50] I got used to setting it as on some NIC drivers it will reset the port when the mode is switched [17:04:55] anyway not a worry here [17:05:11] the mirror is in place and working, so when you have time to run a test we can do it and capture stuff [17:05:50] i'm ready now, if you have time [17:06:08] cool, gimme 5 mins want to grab some tea [17:06:16] sounds good [17:14:24] ok let's give it a shot I guess? [17:14:42] there is a chance we won't see the DHCP traffic, not 100%, but let's see how it goes [17:15:05] I guess we should try all 3 scenarios and capture them so at least we've a record? [17:15:22] i.e. warm/cold boot with standard url, plus the same with IP as the URL? [17:23:17] topranks: sounds good [17:24:03] jhathaway: let me know which you want to do first so we name the files right [17:24:20] or if you want to do it yourself, this is the command I was gonna run on ganeti2033: [17:24:36] sudo tcpdump -s0 -w /tmp/ -i eno12409np1 [17:25:09] thanks, I'll just run myself, probably easier, anything I should do once I'm done? [17:26:23] not really no [17:26:50] well we need to reset that port, I think "ip link set dev eno12409np1 down" should do it [17:26:57] but you can ping me I'll tidy up the switch side too [17:27:03] sounds good, thanks [17:27:15] I'd scp the file off after the first one and have a look verify it's working too [17:27:24] ping me then I'm interested to have a look also [17:27:24] good idea, will do [17:51:54] topranks: all done, captured 4 scenarios, hostname-{warm,cold}, ip-{warm,cold} [17:52:06] only hostname-cold, hangs [17:56:09] thanks again, stepping away for lunch [18:02:07] ok thanks jesse [18:02:17] yeah having a quick look I think our conclusion from before still applies [18:03:56] https://usercontent.irccloud-cdn.com/file/1fvGpkHE/image.png [18:04:23] The receive window reported by the host just goes down and down from the initial 65535, until it hits zero and our apt server stops sending [18:04:37] at least we have the captures including DHCP, DNS etc now we can show them [18:05:30] suggests the HTTP client implementation is not pulling the bytes of the receive TCP buffer on the host or something like that [18:06:35] we don't see the HEAD request, so as we were saying that must be something to do with it [18:10:06] Just seen that email they sent.... doesn't make sense to me. I'd bat it back to them and say "surely if the switch drops the HTTP packet the host will re-send it when the HTTP server fails to send a TCP ACK for it?" [18:11:27] one would also be sceptical that the switch would somehow do this always when the FQDN is used but never if the IP, which it won't be looking at [18:12:27] also even if the HEAD request was dropped we'd see the TCP handshake for that flow, but all we see is a single TCP conversation, the one used for the GET [19:10:35] thanks top.ranks for the analysis and help [19:51:19] do we log switch port link states anywhere, just wanted to confirm the link doesn't flap [20:25:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [20:50:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag