[00:37:40] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on cp7011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:23:50] Depooling cp7011 - that's odd [01:32:08] not super thrilled about that issue - it looks to have been some sort of race condition - a reboot fixed it and the output of ifup@eno12399np0.service is as expected [01:32:28] Previously it failed on boot with: [01:32:29] > Error: ipv6: address already assigned. [01:32:34] > ifup: failed to bring up eno12399np0 [01:33:12] When I examined the interfaces they seemed correct, though. So yeah, I suspect a race condition somewhere. If it happens again I think we can action it but for now we'll chalk it up to space radiation [01:40:25] RESOLVED: [3x] SystemdUnitFailed: haproxy.service on cp7011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:46] slyngs: o/ when you are around ok if we merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1283821 ? [07:21:12] elukey: Go ahead :-) [07:22:06] I have some cp-host reboots today, but I can for your patch to roll out and settle [07:26:32] slyngs: ack! Ok for me to disable puppet on all cps and then roll out incrementally? [07:26:53] Absolutely, let me know if you need anything from me [07:27:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh6002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=drmrs&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:27:31] all right proceeding! [07:32:00] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh6002:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=drmrs&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [07:35:50] slyngs: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289754 my soul is sad and shattered [07:36:14] didn't see any of that in pcc [07:38:12] I'm a little disappointed that pcc didn't spot that [07:38:15] +1 [07:39:25] I think that puppet hates me after all this time [07:41:26] all right on cp3066 I see [07:41:27] Notice: /Stage[main]/Profile::Cache::Haproxy/File[/etc/haproxy/ip-reputation.d/]/ensure: created [07:41:30] that is perfect [07:41:35] going to re-enable puppet [07:42:41] slyngs: any preference on what host to use to flip the hiera flag on? [07:42:48] or should I pick one at random? [07:42:56] I like testing on magru :-) [07:43:06] But chefs choice [07:44:21] Vg did call me crazy or something like that when I tested on esams doing EU daytime, so maybe not esams and drmrs [07:50:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289808 [07:53:24] +1 [07:59:10] let's keep an eye on magru TTFB or other metrics, to check if this doesn't slow down requests for impacted hosts [08:07:33] cp7002 for some reason started out a little slower than the rest, so we need to compare it to itself [08:20:32] thanks for the help folks! [08:22:04] Anytime. Could you let me know when Puppet is re-enabled? [08:41:43] it is already re-enabled [08:41:48] (sorry I didn't see the msg) [09:16:30] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11939628 (10MoritzMuehlenhoff) [09:23:33] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11939641 (10ayounsi) [09:23:55] 06Traffic, 06DC-Ops, 10decommission-hardware, 10ops-codfw, 06SRE: Decommision hosts cp2041 - cp2042 - https://phabricator.wikimedia.org/T426828 (10Fabfur) 03NEW [11:59:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp7008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:25] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp7008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:32:53] 06Traffic, 10Beta-Cluster-Infrastructure: No Puppet resources found on instance deployment-cache-upload08 on project deployment-prep - https://phabricator.wikimedia.org/T426822#11940439 (10ssingh) ` Error: Failed to apply catalog: Parameter source failed on File[/etc/haproxy/ip-reputation.d/top_10000_ips_reque... [12:36:17] 06Traffic, 06ServiceOps new, 06Machine-Learning-Team (Q4 FY2025-26), 13Patch-For-Review: k8s changes needed to allow article topic (and other future isvcs) to use the kserve v2 inference protocol (and gRPC) - https://phabricator.wikimedia.org/T424049#11940457 (10JMeybohm) I probably don't have all the deta... [12:52:22] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11940555 (10JMeybohm) [12:58:22] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp7016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:22] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp7016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:10] 06Traffic, 06SRE, 13Patch-For-Review: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11941096 (10Cuthead) 05Open→03Resolved a:03Cuthead [14:26:40] 06Traffic, 06SRE, 13Patch-For-Review: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11941098 (10Cuthead) Thanks everyone. [14:27:53] 06Traffic, 06SRE, 13Patch-For-Review: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11941105 (10SLyngshede-WMF) 05Resolved→03Open p:05Triage→03Medium @Cuthead Sorry, I'll just reopen this. I still need to do the second half of the caching servers, t... [14:44:01] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941232 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b79d6725-839f-4b80-8718-5cb7000c8fbf) set by pt1979@cumin1003 for 2:00:00 on 3 host(s) and their services with reason: s... [15:12:12] 06Traffic, 05Bot detection and mitigation (WE4.10 hCaptcha), 07Documentation, 06Product Safety and Integrity (Sprint lily-of-the-valley (May 4 - May 22)): hcaptcha proxy: update wikitech page - https://phabricator.wikimedia.org/T411131#11941368 (10Dreamy_Jazz) a:03Dreamy_Jazz I'll take a stab at updating... [15:33:27] 06Traffic, 06DC-Ops, 10decommission-hardware, 10ops-codfw, and 2 others: Decommision hosts cp2041 - cp2042 - https://phabricator.wikimedia.org/T426828#11941439 (10BCornwall) [15:46:47] 06Traffic, 06DC-Ops, 10decommission-hardware, 10ops-codfw, and 2 others: Decommision hosts cp2041 - cp2042 - https://phabricator.wikimedia.org/T426828#11941515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brett@cumin2002 for hosts: `cp[2041-2042].codfw.wmnet` - cp2041.codfw.wmnet (**... [15:53:20] FIRING: DnsboxServiceMismatch: Service ntp-b state mismatch on dns6002:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=drmrs&var-instance=dns6002:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [15:53:28] ^ restart [15:58:20] RESOLVED: DnsboxServiceMismatch: Service ntp-b state mismatch on dns6002:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=drmrs&var-instance=dns6002:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [16:08:06] 06Traffic, 05Bot detection and mitigation (WE4.10 hCaptcha), 07Documentation, 06Product Safety and Integrity (Sprint lily-of-the-valley (May 4 - May 22)): hcaptcha proxy: update wikitech page - https://phabricator.wikimedia.org/T411131#11941633 (10Dreamy_Jazz) Putting in 'Needs review' so that others can r... [16:08:23] 06Traffic, 05Bot detection and mitigation (WE4.10 hCaptcha), 07Documentation, 06Product Safety and Integrity (Sprint lily-of-the-valley (May 4 - May 22)): hcaptcha proxy: update wikitech page - https://phabricator.wikimedia.org/T411131#11941634 (10Dreamy_Jazz) [16:10:39] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941641 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=57a4ad63-f533-4335-a960-7d2139446ca8) set by pt1979@cumin1003 for 2:00:00 on 3 host(s) and their services with reason: s... [16:36:46] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941779 (10Papaul) [16:48:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp4045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:53:25] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp4045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:40] FIRING: [6x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:05:55] looks like this is part of a larger issue in eqsin [17:06:03] buckle up [17:10:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:15:01] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941945 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fb6f5879-0c86-4cca-be43-4b2cb4494d10) set by pt1979@cumin1003 for 2:00:00 on 4 host(s) and their services with reason: s... [17:23:04] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11941963 (10Papaul) [17:28:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp4040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:33:25] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp4040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:52] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942050 (10ssingh) Yeah I should have been more careful in resolving this, my bad. @Jhancock.wm: While the DIMM was replaced, we still need to look at the RAID thing. [17:50:40] FIRING: [14x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:04:55] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11942089 (10Papaul) [18:05:28] 10netops, 06Infrastructure-Foundations: ulsfo: upgrade routers (2026) - https://phabricator.wikimedia.org/T416562#11942095 (10Papaul) 05Open→03Resolved All 3 routers are now up to date. [18:09:25] FIRING: [2x] SystemdUnitFailed: haproxy.service on cp4041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:10:40] RESOLVED: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [18:14:25] RESOLVED: [2x] SystemdUnitFailed: haproxy.service on cp4041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:39] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942148 (10ssingh) a:05ssingh→03Jhancock.wm [18:37:19] 06Traffic, 07Sustainability (Incident Followup): Ensure the pre-repooling checklist includes to restart liberica services whenever realserver IPs has changed - https://phabricator.wikimedia.org/T426299#11942198 (10ssingh) p:05Triage→03Low a:03BCornwall [18:38:40] 06Traffic, 06SRE: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#11942203 (10ssingh) Without knowing the details of this, I wanted to point out that the drmrs refresh is upcoming in Q1/Q2 of FY2026 and drmrs like all edge sites, is on Liberica. If there is any rede... [18:59:20] FIRING: DnsboxServiceMismatch: Service ntp-a state mismatch on dns3003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=esams&var-instance=dns3003:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [19:04:20] RESOLVED: DnsboxServiceMismatch: Service ntp-a state mismatch on dns3003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=esams&var-instance=dns3003:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [19:33:39] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942438 (10Jhancock.wm) I put in a different drive from a different manufacturer. this new one should be the same as the old one. lemme know if that works. you might need to manually add it to the... [19:35:26] brett: ^ let's try now I guess [19:36:29] ack [19:45:20] 06Traffic, 07SEO: Bing can't search images from Commons, is Wikimedia denying their requests? - https://phabricator.wikimedia.org/T425850#11942471 (10ssingh) Can someone confirm on how to reproduce this? If I try to go to Bing and do a reverse search with Commons, it seems to work for me. > Here are two IPs t... [19:46:46] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942475 (10BCornwall) @Jhancock.wm Thanks for doing that - however, I feel that there might be some sort of firmware thing going on. Upon reboot I'm seeing this: {F82982529} I'm still seeing th... [19:49:41] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942488 (10Jhancock.wm) i'll pull the server out and take a look. could be the card or a cable to it. [19:53:20] FIRING: DnsboxServiceMismatch: Service ntp-a state mismatch on dns4003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=ulsfo&var-instance=dns4003:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [19:55:07] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942505 (10BCornwall) Actually, I just created a virtual disk in the storage section of idrac and it seems to be attached now. Is that the appropriate way forward with new disks or have I stumbled... [19:58:20] RESOLVED: DnsboxServiceMismatch: Service ntp-a state mismatch on dns4003:9100 - https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch - https://grafana.wikimedia.org/d/96fb573c-0f3c-456a-886c-e50c29f3ed48/dns-box-service-state?var-site=ulsfo&var-instance=dns4003:9100 - https://alerts.wikimedia.org/?q=alertname%3DDnsboxServiceMismatch [21:11:35] 06Traffic, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: No Puppet resources found on instance deployment-cache-upload08 on project deployment-prep - https://phabricator.wikimedia.org/T426822#11942741 (10bd808) p:05Triage→03High a:03ssingh This is currently blocking an unblock request for Beta Clu... [21:17:06] 06Traffic, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: No Puppet resources found on instance deployment-cache-upload08 on project deployment-prep - https://phabricator.wikimedia.org/T426822#11942764 (10bd808) >>! In T426822#11942742, @bd808 wrote: > This is currently blocking an unblock request for Be... [21:21:07] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942792 (10BCornwall) After creating the virtual disk, re-provisioning (for good measure, though no changes were made), then re-imaging, we're back in business. We might follow-up regarding the us... [21:21:29] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11942795 (10BCornwall) 05Open→03Resolved [21:41:20] 06Traffic, 06DC-Ops, 10ops-codfw: Investigate hardware RAID usage in codfw LVS hosts - https://phabricator.wikimedia.org/T426912 (10BCornwall) 03NEW [21:54:48] 06Traffic, 06DC-Ops, 10ops-codfw: Investigate hardware RAID usage in codfw LVS hosts - https://phabricator.wikimedia.org/T426912#11942911 (10BCornwall) Chatted with @Papaul and I'm told that it's a requirement for us to set each drive as their own virtual disk in RAID0 for the drives to be accessible/online.... [23:02:55] 10netops, 06Infrastructure-Foundations, 06SRE: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11943144 (10Papaul) I sent a follow up email on this and Engineer said he will get back with me