[01:30:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp1115:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1115&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [01:35:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp1115:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1115&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [03:25:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo: ULSFO: Decommision old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427246 (10Papaul) 03NEW [03:42:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp5025:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqsin%20prometheus/ops&var-instance=cp5025&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [03:47:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp5025:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqsin%20prometheus/ops&var-instance=cp5025&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [04:34:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: Decommision old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427246#11953658 (10Papaul) [04:47:42] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: Decommision old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427246#11953661 (10Papaul) [04:49:42] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: Decommision old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427246#11953662 (10Papaul) 05Open→03Resolved Both switches are now set to offline. The only step left is for onsite to remove all the cable... [04:51:51] 10netops, 06Infrastructure-Foundations, 06SRE: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11953665 (10Papaul) Email back from Nokia team ` The target release is still being considered. I’ll let you know once we have more information. ` [06:37:50] 06Traffic, 06ServiceOps new, 06Machine-Learning-Team (Q4 FY2025-26): k8s changes needed to allow article topic (and other future isvcs) to use the kserve v2 inference protocol (and gRPC) - https://phabricator.wikimedia.org/T424049#11953740 (10BWojtowicz-WMF) Thanks to the changes to LVS, I was successful wit... [06:51:35] 06Traffic, 06ServiceOps new, 06Machine-Learning-Team (Q4 FY2025-26): k8s changes needed to allow article topic (and other future isvcs) to use the kserve v2 inference protocol (and gRPC) - https://phabricator.wikimedia.org/T424049#11953749 (10elukey) @BWojtowicz-WMF @DPogorzelski-WMF did you see what Janis s... [07:05:14] 06Traffic, 06ServiceOps new, 06Machine-Learning-Team (Q4 FY2025-26): k8s changes needed to allow article topic (and other future isvcs) to use the kserve v2 inference protocol (and gRPC) - https://phabricator.wikimedia.org/T424049#11953773 (10BWojtowicz-WMF) @elukey Thanks, I indeed missed it! Initially I t... [07:29:01] 06Traffic, 06ServiceOps new, 06Machine-Learning-Team (Q4 FY2025-26): k8s changes needed to allow article topic (and other future isvcs) to use the kserve v2 inference protocol (and gRPC) - https://phabricator.wikimedia.org/T424049#11953810 (10elukey) @BWojtowicz-WMF I think that it is really awesome, we'll a... [07:49:11] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: bird bfd session with 172.20.1.1 down - Bad packet from 172.20.1.1 - unknown session id - https://phabricator.wikimedia.org/T427202#11953849 (10cmooney) Yeah not really sure what happened there @fgiunchedi, a sync issue with the se... [07:51:10] 06Traffic, 06DC-Ops, 10decommission-hardware, 10ops-codfw, and 2 others: Decommision hosts cp2041 - cp2042 - https://phabricator.wikimedia.org/T426828#11953855 (10Fabfur) Thanks! [07:55:21] 06Traffic: Provide better error pages for HAProxy - https://phabricator.wikimedia.org/T352291#11953859 (10Fabfur) I always assumed the ones in **/etc/haproxy/errors/** but they are unused in our current configuration. If we want to actually use them to mimic eventual varnish error pages we should re-introduce th... [08:32:24] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: bird bfd session with 172.20.1.1 down - Bad packet from 172.20.1.1 - unknown session id - https://phabricator.wikimedia.org/T427202#11953989 (10fgiunchedi) Thank you for the detailed explanation @cmooney, definitely TIL things abou... [08:59:33] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: bird bfd session with 172.20.1.1 down - Bad packet from 172.20.1.1 - unknown session id - https://phabricator.wikimedia.org/T427202#11954086 (10cmooney) >>! In T427202#11953989, @fgiunchedi wrote: > Thank you for the detailed expla... [09:07:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp1104:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1104&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [09:17:30] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp1104:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1104&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [09:35:00] hey folks [09:35:48] I am using the sre.loadbalancer.migrate-service-ipip to migrate the aux-codfw master (k8s) to maglev, I just realized now that the cookbook ran the lvs (I thought it was a different step, my bad) [09:36:12] I am going to let it restart pybal on lvs2014 etc.. [09:37:04] I also see fabfur rebooting hosts, I chose a perfect timing :D fabfur ok to proceed? [09:37:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp2043:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2043&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [09:37:52] I'll add some infos to https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/IPIP to sync in here before running the cookbook [09:38:02] anyway, I need to proceed otherwise we'll get alerts [09:38:05] apologies [09:38:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp2044:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2044&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [09:39:52] hey, I'm restarting liberica hosts but I've finished [09:40:08] now I've depooled cp2043 && cp2044 to test haproxy-awslc as I did for magru [09:40:18] all done sorry for the extra pings [09:40:22] I'll do eqiad this afternoon [09:40:25] no prob! thanks! [09:42:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp2043:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2043&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [09:43:07] fixed https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/IPIP [09:43:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp2044:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2044&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [09:52:00] proceeding also with eqiad, all good and I merged the patch for the eqiad endpoint as well [09:57:22] 06Traffic, 06Infrastructure-Foundations, 10Liberica, 10Prod-Kubernetes, and 2 others: Migrate AUX k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420439#11954369 (10elukey) Kubemaster svc in eqiad/codfw moved to Maglev, next step is to do the workers. [09:59:38] At this point I'll also proceed with the other service, seems really well handled by the cookbook [10:16:58] 06Traffic, 06Infrastructure-Foundations, 10Liberica, 10Prod-Kubernetes, and 2 others: Migrate AUX k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420439#11954405 (10elukey) 05Open→03Resolved a:03elukey All done! [10:18:10] 06Traffic, 10Liberica, 06Machine-Learning-Team, 10Prod-Kubernetes, 07Kubernetes: Migrate ML k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420438#11954409 (10elukey) Ping :) [10:52:30] FIRING: HAProxyRestarted: HAProxy server restarted on cp5029:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqsin%20prometheus/ops&var-instance=cp5029&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [10:57:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp5029:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqsin%20prometheus/ops&var-instance=cp5029&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [11:22:38] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11954730 (10FCeratto-WMF) `es2042` and `es2041` in section `es4` have been switched: `es2041` is now a replica and can be depooled [11:56:44] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11954962 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=52314e9c-92e4-4ce8-aff3-713ec1b15d3f) set by jynus@cumin1003 for 6:00... [11:57:41] slyngs, fabfur, we're about to start codfw rack A2 maintenance but lvs2011 needs to be depooled, can one of you assist with it? [12:03:30] back now from lunch let me check quickly [12:03:43] <3 [12:04:27] fabfur: Is that pybal or Liberica? [12:04:30] pybal [12:04:42] Ah, then I'm less sure about the operation [12:06:20] I'll downtime, disable puppet and stop pybal on 2011, it should switch over 2014 [12:08:10] cool, thx! [12:09:38] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955004 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=88812e40-edf3-45b2-b6f9-ae1f746a9dee) set by fabfur@cumin1003 for 2:0... [12:11:56] puppet disabled, let me stop pybal [12:13:57] pybal stopped I see connections migrating to lvs2014 [12:15:24] XioNoX: let's wait a bit for traffic to be completely over [12:15:37] I've set a downtime of 2 hours, should it be sufficient? [12:16:00] fabfur: yep [12:16:02] thanks! [12:20:43] I think we're good to go [12:24:23] XioNoX: let me know when the activity is done [12:24:29] will do! [12:25:14] I see also co2043 should be depooled [12:25:19] *cp2043 [12:25:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp5027:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqsin%20prometheus/ops&var-instance=cp5027&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [12:25:35] I'll do [12:26:14] {{done}} [12:28:49] thanks, I was going to do it, but that helps [12:29:19] no worry, btw I was working on cp2043 previously to test for haproxy-awslc so I was ready :D [12:30:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp5027:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqsin%20prometheus/ops&var-instance=cp5027&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [12:31:30] FIRING: HAProxyRestarted: HAProxy server restarted on cp5027:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqsin%20prometheus/ops&var-instance=cp5027&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [12:45:44] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp5027:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqsin%20prometheus/ops&var-instance=cp5027&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [13:02:13] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11955114 (10MatthewVernon) [13:10:02] 06Traffic, 05Bot detection and mitigation (WE4.10 hCaptcha), 07Documentation, 06Product Safety and Integrity (Sprint Iris (May 25 - Jun 12)): hcaptcha proxy: update wikitech page - https://phabricator.wikimedia.org/T411131#11955153 (10Raine) LGTM for the PSI parts. Thank you! @ssingh do you want to claim... [13:28:12] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955234 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0ff82d6d-6a46-4d3b-b727-57ef8402c512) set by ayounsi@cumin1003 for 2:... [13:29:30] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955244 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c14853fb-268e-4348-b4c0-d1f48c81fb76) set by ayounsi@cumin1003 for 2:... [13:31:42] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955251 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by ayounsi@cumin1003 depool for host wikikube-ctrl2003.codfw.w... [13:34:34] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955252 (10ops-monitoring-bot) Completed depooling of db2196 by ayounsi@cumin1003: switch maintenance [13:35:21] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955254 (10ops-monitoring-bot) Completed depooling of db2221 by ayounsi@cumin1003: switch maintenance [13:35:59] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955255 (10ops-monitoring-bot) Completed depooling of db2222 by ayounsi@cumin1003: switch maintenance [13:36:31] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955256 (10ops-monitoring-bot) Completed depooling of db2223 by ayounsi@cumin1003: switch maintenance [13:45:58] 06Traffic, 05Bot detection and mitigation (WE4.10 hCaptcha), 07Documentation, 06Product Safety and Integrity (Sprint Iris (May 25 - Jun 12)): hcaptcha proxy: update wikitech page - https://phabricator.wikimedia.org/T411131#11955273 (10Dreamy_Jazz) Moving to done on PSI board [13:46:08] 06Traffic, 05Bot detection and mitigation (WE4.10 hCaptcha), 07Documentation, 06Product Safety and Integrity (Sprint Iris (May 25 - Jun 12)): hcaptcha proxy: update wikitech page - https://phabricator.wikimedia.org/T411131#11955274 (10Dreamy_Jazz) 05In progress→03Open [13:47:40] FIRING: [5x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:52:40] FIRING: [14x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:53:17] fabfur: A2 maintenance is over, monitoring it for a bit then will repool services [13:53:48] XioNoX: thanks [13:53:56] I can do lvs if you want [13:54:43] fabfur: sure, thx, in like 5min [13:57:40] FIRING: [14x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:57:53] fabfur: actually wait a bit more pleae [13:57:54] please [13:57:57] np [13:58:04] 06Traffic, 06ServiceOps new, 06Machine-Learning-Team (Q4 FY2025-26): k8s changes needed to allow article topic (and other future isvcs) to use the kserve v2 inference protocol (and gRPC) - https://phabricator.wikimedia.org/T424049#11955333 (10BWojtowicz-WMF) We confirmed that gRPC endpoints works via standar... [14:02:44] FIRING: [14x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:07:40] FIRING: [14x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:07:48] fabfur: you can repool the services [14:08:03] ack thanks [14:08:32] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955386 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by ayounsi@cumin1003 pool for host wikikube-ctrl2003.codfw.wmn... [14:13:23] ok lvs2011 is taking connections back [14:14:02] XioNoX: I'll repool cp2043 [14:14:13] thx [14:15:00] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955409 (10ops-monitoring-bot) Starting pool of db2223 by ayounsi@cumin1003: switch maintenance [14:17:32] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo: ULSFO: Unrack old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427283 (10Papaul) 03NEW [14:18:19] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: Decommision old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427246#11955437 (10Papaul) [14:22:40] FIRING: [13x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:26:57] 06Traffic: Reboot lvs1019 for memory self-healing - https://phabricator.wikimedia.org/T426109#11955484 (10ssingh) Once we reboot for T426585, we can consider this resolved as well. [14:27:40] FIRING: [11x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:32:40] FIRING: [14x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:42:40] FIRING: [12x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:47:40] FIRING: [9x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:49:58] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955618 (10ops-monitoring-bot) Starting pool of db2221 by fceratto@cumin1003: Rack maintenance completed [14:51:18] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955626 (10ops-monitoring-bot) Starting pool of db2222 by fceratto@cumin1003: Rack maintenance completed [14:52:40] RESOLVED: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:57:07] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955679 (10ops-monitoring-bot) Starting pool of db2196 by fceratto@cumin1003: Rack maintenance completed [15:00:25] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955702 (10ops-monitoring-bot) Completed pooling of db2223 by ayounsi@cumin1003: switch maintenance [15:05:06] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955724 (10ops-monitoring-bot) Completed pooling of db2221 by fceratto@cumin1003: Rack maintenance completed [15:05:17] 06Traffic, 10Lift-Wing, 06ServiceOps new, 10ServiceOps-SharedInfra, and 2 others: Host Qwen 3.6-27B as an inference service - https://phabricator.wikimedia.org/T425680#11955725 (10Clement_Goubert) After some testing, both the `rest-gateway` and ATS stream the response correctly. The issue is in the upper-l... [15:06:33] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955732 (10ops-monitoring-bot) Completed pooling of db2222 by fceratto@cumin1003: Rack maintenance completed [15:12:15] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955744 (10ops-monitoring-bot) Completed pooling of db2196 by fceratto@cumin1003: Rack maintenance completed [15:12:45] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955745 (10FCeratto-WMF) db2196, db2221 and db2222 have silences removed and are fully pooled-in [15:41:39] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: codfw: rack A2 maintenance - https://phabricator.wikimedia.org/T426199#11955910 (10ayounsi) 05Open→03Resolved Switch upgraded ! Thanks all for the help, next one is going to be easier :) [16:40:04] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301 (10ayounsi) 03NEW p:05Triage→03Medium [16:41:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp2044:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2044&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [16:41:43] FIRING: HaproxyKafkaRestarted: HaproxyKafka restarted on cp2044:9100 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaRestarted - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=codfw&var-instance=cp2044 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaRestarted [16:43:29] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11956307 (10ayounsi) [16:44:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11956308 (10ayounsi) [16:44:24] 10netops, 06Infrastructure-Foundations, 06ServiceOps new: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301#11956311 (10ayounsi) [16:44:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11956312 (10ayounsi) [16:44:58] brett: did we reboot cp2044? [16:45:18] yes [16:45:45] ah ok this is the one due to the TLS keys [16:46:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp2044:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2044&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [16:46:43] RESOLVED: HaproxyKafkaRestarted: HaproxyKafka restarted on cp2044:9100 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaRestarted - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=codfw&var-instance=cp2044 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaRestarted [17:03:55] 06Traffic, 06Data-Engineering, 13Patch-For-Review: Add X-Provenance data to webrequest_sampled_live - https://phabricator.wikimedia.org/T427068#11956411 (10CDanis) Thank you Luca! We have a set of well-known keys for this field, and we'd like to expose that all to clients meaningfully, so I think that we sh... [18:07:23] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11956719 (10wiki_willy) In regards to buying a new CPU - we don't have any more budget available for FY25-26, but I'm ok with going over budget if this is the best route forward. We'll have four addi... [19:04:09] sukhe: Was ^this your requestctl stuff mentioned in frontline-defenses?? [19:04:26] things look okay now but just checking [19:04:54] ugh, I meant to post on -operations, there's VCL reloading fires [19:04:56] brett: the restart one? no, this was the TLS keys not loading yet one [19:05:06] oh yeah, the VCL reload one [19:05:11] that must be because of the rule update [19:05:22] I will force a NOOP run to clear that [19:08:02] 06Traffic: Investigate setting init_on_alloc=0 on cache hosts - https://phabricator.wikimedia.org/T401025#11956891 (10BCornwall) Fine with that, though it's not a huge priority! [22:20:49] 06Traffic: Reboot lvs1019 for memory self-healing - https://phabricator.wikimedia.org/T426109#11957496 (10BCornwall) @ssingh The Dell docs mention updating the BIOS: > update BIOS to the latest revision that includes many memory Self-healing capabilities and ongoing enhancements However, that would put our ver... [22:37:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp5026:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqsin%20prometheus/ops&var-instance=cp5026&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [23:02:29] RESOLVED: HAProxyRestarted: HAProxy server restarted on cp5026:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqsin%20prometheus/ops&var-instance=cp5026&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [23:05:25] ^stuck lua [23:08:26] * swfrench-wmf nods [23:10:38] it very well may just shift the problem to another server but I depooled it