[03:35:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:35:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:51:05] <topranks>	 elukey: thanks for those checks on the memcached stuff, I 100% agree those stats do look like an issue and could well be the cause of the timeouts 
[09:51:31] <topranks>	 and yes accept() is different than EAGAIN, it's failing to create the socket which 100% you can imagine leading to a timeout error on the client side 
[09:57:57] <elukey>	 topranks: o/ yeah it is very worrysome, but I don't see a clear sign of distress in the config. I expected something ulimit-related but checking the current memcached process limits it seems around 25k, and the sockets opened are around 5k (but maybe I am missing something)
[09:58:36] <elukey>	 do you have any ideas?
[09:58:38] <topranks>	 hmm yeah that's the bit I don't understand either (if it is a constraint what the constraint is) 
[09:59:00] <topranks>	 I was looking at some of the mc hosts to see about load or cpu pressure but they all seem relatively stable 
[09:59:26] <topranks>	 they are on 10G NICs mostly and doing maybe 1G of traffic so pipeline from network card to OS should not be locking up, cpu seems ok 
[09:59:55] <topranks>	 being a selfish netops of course if I can prove it's the hosts not the network I will just ride off into the sunset :P 
[10:00:36] <topranks>	 but honestly I don't think it's the network, can't see any signs of a problem plus I'd expect us to see problems in lots of places if it was 
[10:03:58] <elukey>	 (meeting brb sorry)
[10:04:59] <topranks>	 just checking there via proc and file descriptor numbers are like you said, maybe 5k in use on the two hosts I checked, and limit is 25k so doesn't seem like that 
[10:32:39] <elukey>	 back!
[10:45:33] <elukey>	 topranks: the other thing is that accept() returning eagain may be benign, namely just async code like read(). Looks weird but it may make sense,  given what we are seeing
[10:45:47] <elukey>	 same for the ssl failed handshakes, they are not a huge amount every day
[10:50:56] <topranks>	 elukey: yeah perhaps, I'm not familiar with exactly how it operates 
[10:51:01] <elukey>	 yeah see https://github.com/memcached/memcached/blob/1.6.18/memcached.c#L3000
[10:51:02] <topranks>	 and it's been there for a while 
[10:52:51] <topranks>	 not an expert at reading that... but it's not logging with the "too many open connections" message, which I guess means it's not that 
[10:53:28] <elukey>	 topranks: I think it logs the error before checking if it is EGAIN
[10:53:56] <elukey>	 so in our case, it may indicate that it is benign
[10:54:15] <elukey>	 I only see "accept4(): Resource temporarily unavailable" in the logs
[10:58:21] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11726917 (10fgiunchedi) FWIW I found some prior art / ideas here {T367592}
[11:10:28] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11726958 (10jijiki)
[11:35:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:42:28] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11727260 (10brouberol) {F73147012} We can see that sockets are no longer leaking after the NIC replace...
[12:42:39] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11727262 (10brouberol) 05Open→03Resolved a:03brouberol
[12:43:21] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11727266 (10brouberol) a:05brouberol→03BTullis
[14:44:29] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06Traffic: esams/magru: 185.71.138.0/24 (wikidough) prefix not advertized - https://phabricator.wikimedia.org/T420342#11727945 (10ayounsi) 05Open→03Resolved Preferred path changed as expected:  `name=esams 185.71.138.138/32  *[BGP/170] 00:00:03, MED 0, localpre...
[15:27:09] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06Traffic: esams/magru: 185.71.138.0/24 (wikidough) prefix not advertized - https://phabricator.wikimedia.org/T420342#11728162 (10cmooney) Nice work!
[15:35:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:45:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:55:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:04:49] <wikibugs>	 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: netbox report error for puppetdb serial versus netbox serial - https://phabricator.wikimedia.org/T420623 (10RobH) 03NEW
[18:10:16] <wikibugs>	 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11729203 (10RobH)
[18:50:29] <wikibugs>	 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11729465 (10Jclark-ctr) a:03Jclark-ctr
[20:15:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:22:31] <wikibugs>	 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 10MediaWiki-extensions-EmailAuth, and 4 others: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047#11729810 (10Krinkle) The volume of this error has doubled after Jan 26-28 ([Logstas...
[21:15:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:06:56] <jinxer-wm>	 FIRING: MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[22:21:56] <jinxer-wm>	 RESOLVED: MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[22:23:56] <jinxer-wm>	 FIRING: MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[22:53:56] <jinxer-wm>	 FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack