[03:35:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:05] elukey: thanks for those checks on the memcached stuff, I 100% agree those stats do look like an issue and could well be the cause of the timeouts [09:51:31] and yes accept() is different than EAGAIN, it's failing to create the socket which 100% you can imagine leading to a timeout error on the client side [09:57:57] topranks: o/ yeah it is very worrysome, but I don't see a clear sign of distress in the config. I expected something ulimit-related but checking the current memcached process limits it seems around 25k, and the sockets opened are around 5k (but maybe I am missing something) [09:58:36] do you have any ideas? [09:58:38] hmm yeah that's the bit I don't understand either (if it is a constraint what the constraint is) [09:59:00] I was looking at some of the mc hosts to see about load or cpu pressure but they all seem relatively stable [09:59:26] they are on 10G NICs mostly and doing maybe 1G of traffic so pipeline from network card to OS should not be locking up, cpu seems ok [09:59:55] being a selfish netops of course if I can prove it's the hosts not the network I will just ride off into the sunset :P [10:00:36] but honestly I don't think it's the network, can't see any signs of a problem plus I'd expect us to see problems in lots of places if it was [10:03:58] (meeting brb sorry) [10:04:59] just checking there via proc and file descriptor numbers are like you said, maybe 5k in use on the two hosts I checked, and limit is 25k so doesn't seem like that [10:32:39] back! [10:45:33] topranks: the other thing is that accept() returning eagain may be benign, namely just async code like read(). Looks weird but it may make sense, given what we are seeing [10:45:47] same for the ssl failed handshakes, they are not a huge amount every day [10:50:56] elukey: yeah perhaps, I'm not familiar with exactly how it operates [10:51:01] yeah see https://github.com/memcached/memcached/blob/1.6.18/memcached.c#L3000 [10:51:02] and it's been there for a while [10:52:51] not an expert at reading that... but it's not logging with the "too many open connections" message, which I guess means it's not that [10:53:28] topranks: I think it logs the error before checking if it is EGAIN [10:53:56] so in our case, it may indicate that it is benign [10:54:15] I only see "accept4(): Resource temporarily unavailable" in the logs [10:58:21] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11726917 (10fgiunchedi) FWIW I found some prior art / ideas here {T367592} [11:10:28] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11726958 (10jijiki) [11:35:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:42:28] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11727260 (10brouberol) {F73147012} We can see that sockets are no longer leaking after the NIC replace... [12:42:39] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11727262 (10brouberol) 05Open→03Resolved a:03brouberol [12:43:21] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11727266 (10brouberol) a:05brouberol→03BTullis [14:44:29] 10netops, 06Infrastructure-Foundations, 06Traffic: esams/magru: 185.71.138.0/24 (wikidough) prefix not advertized - https://phabricator.wikimedia.org/T420342#11727945 (10ayounsi) 05Open→03Resolved Preferred path changed as expected: `name=esams 185.71.138.138/32 *[BGP/170] 00:00:03, MED 0, localpre... [15:27:09] 10netops, 06Infrastructure-Foundations, 06Traffic: esams/magru: 185.71.138.0/24 (wikidough) prefix not advertized - https://phabricator.wikimedia.org/T420342#11728162 (10cmooney) Nice work! [15:35:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:25] FIRING: [4x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:25] FIRING: [4x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:49] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: netbox report error for puppetdb serial versus netbox serial - https://phabricator.wikimedia.org/T420623 (10RobH) 03NEW [18:10:16] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11729203 (10RobH) [18:50:29] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11729465 (10Jclark-ctr) a:03Jclark-ctr [20:15:25] FIRING: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:22:31] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 10MediaWiki-extensions-EmailAuth, and 4 others: Could not send confirmation email: Unknown error in PHP's mail() function. - https://phabricator.wikimedia.org/T383047#11729810 (10Krinkle) The volume of this error has doubled after Jan 26-28 ([Logstas... [21:15:25] FIRING: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:06:56] FIRING: MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:21:56] RESOLVED: MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:23:56] FIRING: MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:53:56] FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack