[00:00:05] (03CR) 10Dzahn: "open review season: who thinks this looks sane? I have done the same between releases servers - now this is between deployment server and " [puppet] - 10https://gerrit.wikimedia.org/r/1217594 (https://phabricator.wikimedia.org/T289858) (owner: 10Dzahn) [00:00:28] (03CR) 10Dzahn: [V:03+1 C:03+1] "the PCC link above shows best what this actually does and does not do" [puppet] - 10https://gerrit.wikimedia.org/r/1217594 (https://phabricator.wikimedia.org/T289858) (owner: 10Dzahn) [00:04:21] 06SRE, 06collaboration-services, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858#11453380 (10Dzahn) There are several different "rsyncs" involved here. This is now resolved, using stunnel, for those data transfers from one relea... [00:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:15:45] (03CR) 10Bking: [C:03+2] opensearch-on-k8s: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217606 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [00:37:23] (03PS1) 10RLazarus: mw-videoscaler: Update to Envoy 1.35.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217609 (https://phabricator.wikimedia.org/T410975) [00:37:26] (03PS1) 10RLazarus: mw-*: Update to Envoy 1.35.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217610 (https://phabricator.wikimedia.org/T410975) [00:37:33] (03PS1) 10RLazarus: {api,rest}-gateway: Update to Envoy 1.35.7 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217611 (https://phabricator.wikimedia.org/T410975) [00:40:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217612 [00:40:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217612 (owner: 10TrainBranchBot) [00:52:31] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217612 (owner: 10TrainBranchBot) [01:00:49] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217615 [01:10:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217615 (owner: 10TrainBranchBot) [01:30:24] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 29m 34s) [01:34:00] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217615 (owner: 10TrainBranchBot) [02:00:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-eqiad:xe-3/1/6 (Transit: ... [02:00:51] NTT (234630) {#3475}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [02:01:08] πŸ‘‹ [02:01:12] o/ [02:02:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [02:05:46] !incidents [02:05:46] 7146 (ACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [02:05:47] 7147 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [02:05:47] 7144 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [02:05:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-eqiad:xe-3/1/6 (Transit: ... [02:05:51] NTT (234630) {#3475}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [02:07:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [02:41:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [02:46:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [02:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:55:21] !incidents [02:55:21] 7149 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [02:55:21] 7147 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [02:55:21] 7146 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [02:55:22] 7144 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [03:10:01] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:31:09] i'm going to let this ^ settle overnight. we're okay right now. this is https://phabricator.wikimedia.org/T412467#11453517 [04:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:10:02] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:02] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:06:30] (03PS1) 10KartikMistry: Update cxserver to 2025-12-11-184429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217618 (https://phabricator.wikimedia.org/T405004) [06:15:58] (03PS2) 10Daniel Kinzler: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) [06:37:52] (03PS3) 10Daniel Kinzler: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) [06:40:14] (03CR) 10Ayounsi: [C:03+2] MR: rollback gNMI [homer/public] - 10https://gerrit.wikimedia.org/r/1133398 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [06:41:52] (03Merged) 10jenkins-bot: MR: rollback gNMI [homer/public] - 10https://gerrit.wikimedia.org/r/1133398 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [06:45:50] (03PS4) 10Daniel Kinzler: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) [06:51:38] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251212T0700) [07:04:55] 10ops-eqiad, 06DBA, 06DC-Ops: db2166 from s8 started lagging, disk latency up, hw issue? - https://phabricator.wikimedia.org/T411085#11453873 (10Marostegui) 05Resolvedβ†’03Open This disk is the one that now shows media errors ` rive /c0/e64/s4 State : ====================== Shield Counter = 0 Media Error... [07:06:37] 10ops-eqiad, 06DBA, 06DC-Ops: db2166 from s8 started lagging, disk latency up, hw issue? - https://phabricator.wikimedia.org/T411085#11453881 (10Marostegui) @Jhancock.wm can you replace the offline disk on this host? Thanks (I think there will be an automatic Degraded RAID task generated in a few minutes) `... [07:06:51] 10ops-eqiad, 06DBA, 06DC-Ops: db2166 from s8 started lagging, disk latency up, hw issue? - https://phabricator.wikimedia.org/T411085#11453882 (10Marostegui) a:05FCeratto-WMFβ†’03Jhancock.wm [07:10:01] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:14:48] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [07:14:48] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [07:14:48] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [07:14:48] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [07:14:50] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [07:14:50] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [07:14:50] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.167 second response time https://wikitech.wikimedia.org/wiki/Swift [07:14:58] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:14:58] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:14:58] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:14:58] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:14:58] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:14:58] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:14:58] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:14:59] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:15:08] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:10] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:15:10] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:15:16] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:15:48] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:48] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:48] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:48] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:48] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:48] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.231 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:48] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:49] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:49] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:50] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:50] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [07:15:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [07:15:52] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.411 second response time https://wikitech.wikimedia.org/wiki/Swift [07:15:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:16:00] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:16:06] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [07:16:10] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:16:10] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:16:35] (03CR) 10Marostegui: [C:03+1] "If this worked fine on your test - go for it" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215116 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [07:16:50] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.152 second response time https://wikitech.wikimedia.org/wiki/Swift [07:16:50] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [07:16:50] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.401 second response time https://wikitech.wikimedia.org/wiki/Swift [07:16:54] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.233 second response time https://wikitech.wikimedia.org/wiki/Swift [07:17:08] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.832 second response time https://wikitech.wikimedia.org/wiki/Swift [07:17:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [07:17:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:17:48] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [07:17:48] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift [07:17:50] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [07:17:50] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:48] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:48] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:48] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [07:18:48] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.470 second response time https://wikitech.wikimedia.org/wiki/Swift [07:19:10] PROBLEM - Ensure acme-chief-api is running on acmechief1002 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief [07:20:10] RECOVERY - Ensure acme-chief-api is running on acmechief1002 is OK: PROCS OK: 1 process with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief [07:20:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:21:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [07:22:13] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [07:22:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:23:46] FIRING: Primary inbound port utilisation over 90% #page: Alert for device cr3-eqsin.wikimedia.org - Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [07:26:10] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on db2166 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Offln : virtual_disk: 1 Dgrd : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:26:12] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on db2166 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Offln : virtual_disk: 1 Dgrd : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T412497 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:26:22] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2166 - https://phabricator.wikimedia.org/T412497 (10ops-monitoring-bot) 03NEW [07:29:43] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db2166 from s8 started lagging, disk latency up, hw issue? - https://phabricator.wikimedia.org/T411085#11453898 (10Marostegui) [07:29:44] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2166 - https://phabricator.wikimedia.org/T412497#11453899 (10Marostegui) [07:30:13] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2166 - https://phabricator.wikimedia.org/T412497#11453902 (10Marostegui) p:05Triageβ†’03Medium See T411085 [07:31:01] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2166 - https://phabricator.wikimedia.org/T412497#11453907 (10Marostegui) Can we get a disk for this host? An used one is fine as this host is out of warranty [07:36:04] !incidents [07:36:04] 7150 (UNACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [07:36:04] 7154 (UNACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:36:04] 7155 (UNACKED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [07:36:05] 7153 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [07:36:05] 7152 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [07:36:05] 7151 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [07:36:05] 7149 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [07:36:05] 7147 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [07:36:06] 7146 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [07:36:06] 7144 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [07:36:17] !ack 7150 [07:36:17] 7150 (ACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [07:36:23] !ack 7154 [07:36:23] 7154 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:36:28] !ack 7155 [07:36:28] 7155 (ACKED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [07:39:14] (03CR) 10Ryan Kemper: [C:03+2] sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [07:41:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [07:43:46] RESOLVED: Primary inbound port utilisation over 90% #page: Device cr3-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [07:44:46] (03Merged) 10jenkins-bot: sre.data engineering cookbooks: use get_subset [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [07:45:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [07:45:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [07:51:38] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251212T0800) [08:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:21:01] (03PS1) 10Brouberol: opensearch-ipoid: increase the cpu and memory of each master in the cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217630 (https://phabricator.wikimedia.org/T408238) [08:30:54] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 12709 [08:31:37] (03CR) 10Federico Ceratto: [C:03+2] clone.py: Accept both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1215116 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [08:32:03] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 12709 [08:36:12] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 14537 [08:39:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 14537 [08:50:01] (03CR) 10Scott French: [C:03+1] service: add exclude_from_switchover field. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1217520 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [08:51:16] !log gehel@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie [08:54:14] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 8560 [08:55:04] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8560 [09:02:08] (03PS2) 10KartikMistry: Update cxserver to 2025-12-11-184429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217618 (https://phabricator.wikimedia.org/T405004) [09:05:41] (03CR) 10Brouberol: [C:03+2] opensearch-ipoid: increase the cpu and memory of each master in the cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217630 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [09:08:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [09:08:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [09:09:11] !log gehel@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1032.eqiad.wmnet with reason: host reimage [09:09:53] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-12-11-184429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217618 (https://phabricator.wikimedia.org/T405004) (owner: 10KartikMistry) [09:10:55] (03Abandoned) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216190 (https://phabricator.wikimedia.org/T405578) (owner: 10Daniel Kinzler) [09:11:37] (03Merged) 10jenkins-bot: Update cxserver to 2025-12-11-184429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217618 (https://phabricator.wikimedia.org/T405004) (owner: 10KartikMistry) [09:14:35] !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1032.eqiad.wmnet with reason: host reimage [09:19:17] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:19:41] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:20:37] (03PS2) 10AOkoth: collab: add vrts junk queue alert [alerts] - 10https://gerrit.wikimedia.org/r/1217548 (https://phabricator.wikimedia.org/T408632) [09:25:07] (03CR) 10Vgutierrez: ats: gerrit: don't validate TLS host for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [09:31:36] (03CR) 10Blake: [C:03+2] service: add exclude_from_switchover field. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1217520 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [09:32:01] (03CR) 10Vgutierrez: ats: gerrit: don't validate TLS host for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [09:35:50] (03PS5) 10Daniel Kinzler: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) [09:51:24] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [09:51:28] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [09:51:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-eqiad:xe-3/1/6 (Transit: ... [09:51:51] NTT (234630) {#3475}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [09:53:29] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217497 (owner: 10L10n-bot) [09:56:28] !incidents [09:56:28] 7156 (UNACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [09:56:28] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [09:56:28] 7155 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [09:56:28] 7154 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [09:56:29] 7153 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [09:56:29] 7152 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [09:56:29] 7151 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [09:56:29] 7149 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [09:56:29] 7147 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [09:56:30] 7146 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [09:56:30] 7144 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [09:56:36] !ack 7156 [09:56:36] 7156 (ACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [09:56:46] FIRING: Primary inbound port utilisation over 90% #page: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [09:56:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [09:56:51] FIRING: [2x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-eqiad:xe-3/1/6 (Transit: NTT (234630) {#3475}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [10:01:46] RESOLVED: Primary inbound port utilisation over 90% #page: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [10:01:46] FIRING: [3x] Primary outbound port utilisation over 90% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [10:01:51] FIRING: [3x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [10:03:58] (03PS1) 10Ayounsi: Link saturation alerts: shorten IRC message [alerts] - 10https://gerrit.wikimedia.org/r/1217707 [10:07:40] !incidents [10:07:40] 7156 (ACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [10:07:40] 7158 (UNACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [10:07:41] 7157 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [10:07:41] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [10:07:41] 7155 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [10:07:41] 7154 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [10:07:41] 7153 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [10:07:42] 7152 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [10:07:42] 7151 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [10:07:43] 7149 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [10:07:43] 7147 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [10:07:44] 7146 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [10:07:44] 7144 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [10:07:46] FIRING: Primary inbound port utilisation over 90% #page: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [10:07:49] !ack 7158 [10:07:49] 7158 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [10:10:26] (03CR) 10Jcrespo: [C:03+1] Link saturation alerts: shorten IRC message [alerts] - 10https://gerrit.wikimedia.org/r/1217707 (owner: 10Ayounsi) [10:10:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [10:11:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [10:16:48] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [10:16:51] FIRING: [4x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [10:16:51] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [10:17:02] !incidents [10:17:02] 7156 (ACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [10:17:02] 7158 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [10:17:03] 7159 (UNACKED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [10:17:03] 7157 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [10:17:03] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [10:17:03] 7155 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [10:17:03] 7154 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [10:17:04] 7153 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [10:17:04] 7152 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [10:17:05] 7151 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [10:17:05] 7149 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [10:17:06] 7147 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [10:17:06] 7146 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [10:17:07] 7144 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [10:17:11] !ack 7159 [10:17:11] 7159 (ACKED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [10:21:46] FIRING: [3x] Primary outbound port utilisation over 90% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [10:21:51] FIRING: [4x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [10:22:46] RESOLVED: Primary inbound port utilisation over 90% #page: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [10:26:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr2-eqdfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [10:26:51] RESOLVED: [3x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [10:32:22] (03PS1) 10Blake: Revert "service: add exclude_from_switchover field." [puppet] - 10https://gerrit.wikimedia.org/r/1217717 [10:34:45] (03CR) 10CI reject: [V:04-1] Revert "service: add exclude_from_switchover field." [puppet] - 10https://gerrit.wikimedia.org/r/1217717 (owner: 10Blake) [10:35:44] (03PS2) 10Blake: Revert "service: add exclude_from_switchover field." [puppet] - 10https://gerrit.wikimedia.org/r/1217717 [10:37:06] FIRING: [3x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [10:38:59] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [10:39:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T410589)', diff saved to https://phabricator.wikimedia.org/P86511 and previous config saved to /var/cache/conftool/dbconfig/20251212-103907-ladsgroup.json [10:39:11] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [10:40:08] (03CR) 10Scott French: [C:03+1] "Wow, that's extremely brittle! When you re-spin this, maybe extend the doc comment to note the "happens before" requirement." [puppet] - 10https://gerrit.wikimedia.org/r/1217717 (owner: 10Blake) [10:41:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [10:42:22] (03PS3) 10Blake: Revert "service: add exclude_from_switchover field." [puppet] - 10https://gerrit.wikimedia.org/r/1217717 [10:43:07] (03CR) 10Cathal Mooney: [C:03+1] "Thanks! We'll get it re-submitted with the spicerack change next week" [puppet] - 10https://gerrit.wikimedia.org/r/1217717 (owner: 10Blake) [10:43:57] (03CR) 10Cathal Mooney: [C:03+1] Link saturation alerts: shorten IRC message [alerts] - 10https://gerrit.wikimedia.org/r/1217707 (owner: 10Ayounsi) [10:45:21] (03PS3) 10Federico Ceratto: sre.mysql.pool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [10:45:34] (03CR) 10Blake: [C:03+2] Revert "service: add exclude_from_switchover field." [puppet] - 10https://gerrit.wikimedia.org/r/1217717 (owner: 10Blake) [10:47:06] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [10:47:06] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [10:47:50] !log gehel@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1032.eqiad.wmnet with OS trixie [10:48:11] !incidents [10:48:11] 7161 (UNACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [10:48:12] 7160 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (gnmi codfw) [10:48:12] 7156 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [10:48:12] 7158 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [10:48:12] 7159 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [10:48:12] 7157 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [10:48:13] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [10:48:13] 7155 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [10:48:13] 7154 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [10:48:14] 7153 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [10:48:14] 7152 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [10:48:14] 10SRE-Access-Requests: Offboarding for joelyrookewmde - https://phabricator.wikimedia.org/T412508 (10JoelyRooke-WMDE) 03NEW [10:48:15] 7151 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [10:48:15] 7149 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [10:48:16] 7147 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [10:49:02] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217722 [10:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:57:42] !ack 7161 [10:57:43] 7161 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [10:59:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [10:59:51] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [11:00:04] !incidents [11:00:06] 7161 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:00:06] 7162 (UNACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [11:00:06] 7160 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (gnmi codfw) [11:00:06] 7156 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [11:00:07] 7158 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:00:07] 7159 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [11:00:07] 7157 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [11:00:07] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [11:00:07] 7155 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [11:00:08] 7154 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:00:08] 7153 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [11:00:09] 7152 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [11:00:21] !ack 7162 [11:00:22] 7162 (ACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [11:01:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [11:04:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [11:04:51] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [11:10:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:32:11] !incidents [11:32:11] 7162 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [11:32:11] 7161 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:32:12] 7160 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (gnmi codfw) [11:32:12] 7156 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [11:32:12] 7158 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:32:12] 7159 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [11:32:12] 7157 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [11:32:13] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [11:32:13] 7155 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [11:32:14] 7154 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:32:14] 7153 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [11:32:15] 7152 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [11:32:15] 7151 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [11:33:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [11:33:51] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [11:36:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [11:38:47] !incidents [11:38:47] 7163 (UNACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [11:38:48] 7164 (UNACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:38:48] 7162 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [11:38:48] 7161 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:38:48] 7160 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (gnmi codfw) [11:38:48] 7156 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [11:38:49] 7158 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:38:49] 7159 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [11:38:49] 7157 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [11:38:50] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [11:38:50] 7155 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [11:38:51] 7154 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:38:51] 7153 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [11:38:52] 7152 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [11:39:06] !ack 7163 [11:39:06] 7163 (ACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [11:39:08] !ack 7164 [11:39:08] 7164 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:46:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [11:48:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [11:48:51] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [11:49:01] !incidents [11:49:02] 7163 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [11:49:02] 7164 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:49:02] 7162 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [11:49:02] 7161 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:49:02] 7160 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (gnmi codfw) [11:49:03] 7156 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [11:49:03] 7158 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:49:03] 7159 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [11:49:03] 7157 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [11:49:04] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [11:49:04] 7155 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [11:49:05] 7154 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:57:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [11:57:51] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [11:58:32] !ack [11:58:32] no value provided for parameter incident and no default available [11:58:32] Incident id must be an integer [11:58:38] !ack 7163 [11:58:39] Attempt to ack incident 7163 failed. [11:58:41] !ack 7164 [11:58:41] Attempt to ack incident 7164 failed. [11:58:47] !incidents [11:58:47] 7165 (UNACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [11:58:48] 7163 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [11:58:48] 7164 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:58:48] 7162 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [11:58:48] 7161 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:58:48] 7160 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (gnmi codfw) [11:58:49] 7156 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [11:58:49] 7158 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:58:49] 7159 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [11:58:50] 7157 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [11:58:50] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [11:58:51] 7155 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [11:58:51] 7154 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [11:58:52] 7153 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [11:58:57] !ack 7165 [12:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251212T0800) [12:00:05] jelto, arnoldokoth, mutante, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251212T1200). [12:02:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [12:02:51] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [12:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:23:41] 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11454363 (10Michael) Let's make this a train blocker for the next train. If there is somehow an issue in our code that makes deployments likely to fail (though I c... [12:23:50] 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11454364 (10Michael) [12:26:58] (03PS2) 10Aklapper: offboard-user: Remove some outdated privileged Phabricator projects [puppet] - 10https://gerrit.wikimedia.org/r/1184044 [12:27:32] (03CR) 10CI reject: [V:04-1] offboard-user: Remove some outdated privileged Phabricator projects [puppet] - 10https://gerrit.wikimedia.org/r/1184044 (owner: 10Aklapper) [12:29:23] (03PS3) 10Aklapper: offboard-user: Remove some outdated privileged Phabricator projects [puppet] - 10https://gerrit.wikimedia.org/r/1184044 [12:35:08] 06SRE, 06Infrastructure-Foundations, 10netops: No byte counters for interfaces on cr2-codfw PIC 0/0 (MPC10E QSFP28 card) - https://phabricator.wikimedia.org/T412513 (10cmooney) 03NEW p:05Triageβ†’03High [12:40:29] 06SRE, 06Infrastructure-Foundations, 10netops: No byte counters for interfaces on cr2-codfw PIC 0/0 (MPC10E QSFP28 card) - https://phabricator.wikimedia.org/T412513#11454417 (10cmooney) Hmm so this problem is worse than I thought at first. It is not just affecting the gnmic stats, but also the SNMP counters... [12:40:44] (03PS1) 10Tchanders: Add experimental temp account creation rate limits for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217744 (https://phabricator.wikimedia.org/T412222) [12:58:29] 06SRE, 06Infrastructure-Foundations, 10netops: No byte counters for interfaces on cr2-codfw PIC 0/0 (MPC10E QSFP28 card) - https://phabricator.wikimedia.org/T412513#11454493 (10cmooney) 05Openβ†’03Resolved a:03cmooney So it seems this is a known problem, we actually hit it before on another card. To... [13:02:16] (03CR) 10Ayounsi: [C:03+2] Link saturation alerts: shorten IRC message [alerts] - 10https://gerrit.wikimedia.org/r/1217707 (owner: 10Ayounsi) [13:03:28] (03Merged) 10jenkins-bot: Link saturation alerts: shorten IRC message [alerts] - 10https://gerrit.wikimedia.org/r/1217707 (owner: 10Ayounsi) [13:10:48] (03CR) 10Kosta Harlan: [C:04-1] Add experimental temp account creation rate limits for enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217744 (https://phabricator.wikimedia.org/T412222) (owner: 10Tchanders) [13:19:46] (03PS1) 10Brouberol: idp_test: provision the growthbook service config [puppet] - 10https://gerrit.wikimedia.org/r/1217748 (https://phabricator.wikimedia.org/T411752) [13:20:09] (03CR) 10Santiago Faci: "Also happy to consider this change as the one we are going to merge. We can abandon the other and wait until the extension is ready to mer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [13:22:27] !log gehel@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [13:22:43] !log gehel@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS trixie [13:24:01] (03PS1) 10Brouberol: growthbook: connect production instance to idp-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217749 (https://phabricator.wikimedia.org/T411752) [13:25:02] (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217749 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [13:25:44] (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215525 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [13:26:05] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1217748 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [13:26:47] (03CR) 10Brouberol: [C:03+2] idp_test: provision the growthbook service config [puppet] - 10https://gerrit.wikimedia.org/r/1217748 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [13:29:14] (03CR) 10Brouberol: [C:03+2] Growthbook: setup OIDC login [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215525 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [13:29:17] (03CR) 10Brouberol: [C:03+2] growthbook: connect production instance to idp-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217749 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [13:31:13] (03Merged) 10jenkins-bot: Growthbook: setup OIDC login [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215525 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [13:31:14] (03Merged) 10jenkins-bot: growthbook: connect production instance to idp-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217749 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [13:32:43] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2166 - https://phabricator.wikimedia.org/T412497#11454584 (10Jhancock.wm) @Marostegui I can get this one replaced for you today. Be on site soon to take care of it [13:33:13] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2166 - https://phabricator.wikimedia.org/T412497#11454585 (10Marostegui) That'd be great - thank you! [13:35:27] (03CR) 10Marostegui: "Is this ready to review? I didn't check if the CI -1 is valid or a minor thing." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [13:36:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [13:36:52] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2166 from s8 started lagging, disk latency up, hw issue? - https://phabricator.wikimedia.org/T411085#11454607 (10Jclark-ctr) [13:37:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [13:37:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [13:38:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [13:40:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [13:41:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:41:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [13:41:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:41:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2148 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86512 and previous config saved to /var/cache/conftool/dbconfig/20251212-134125-marostegui.json [13:41:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [13:41:33] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [13:41:33] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [13:42:12] (03PS1) 10Brouberol: growthbook: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217752 (https://phabricator.wikimedia.org/T411752) [13:42:21] (03CR) 10CI reject: [V:04-1] growthbook: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217752 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [13:42:41] (03PS2) 10Brouberol: growthbook: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217752 (https://phabricator.wikimedia.org/T411752) [13:44:47] (03CR) 10Brouberol: [C:03+2] growthbook: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217752 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [13:50:15] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti-jumbo1002.eqiad.wmnet with OS trixie [13:50:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11454658 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ganeti-jumbo1... [13:50:48] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-jumbo1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:00:03] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-jumbo1002.eqiad.wmnet with reason: host reimage [14:00:26] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-jumbo1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:02:53] 06SRE, 06Infrastructure-Foundations: Offboarding for joelyrookewmde - https://phabricator.wikimedia.org/T412508#11454679 (10jcrespo) [14:03:27] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti-jumbo1003.eqiad.wmnet with OS trixie [14:03:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11454684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ganeti-jumbo1... [14:04:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-jumbo1002.eqiad.wmnet with reason: host reimage [14:05:28] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T411818#11454686 (10Jclark-ctr) Replaced optic on lvs1020 [14:07:25] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11454699 (10Jclark-ctr) @RKemper I am usually here most mornings early. what day would work best for you next week to down time is there a chance you could down time it the d... [14:10:23] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T411818#11454703 (10Jclark-ctr) 05Openβ†’03Resolved Replaced optic in lvs1020 errors have cleared [14:11:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1191 - https://phabricator.wikimedia.org/T411209#11454705 (10Jclark-ctr) @BTullis. just a reminder this is ready for you then can be closed [14:13:26] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-jumbo1003.eqiad.wmnet with reason: host reimage [14:14:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturati [14:15:30] !incidents [14:15:30] 7166 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [14:15:31] 7165 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [14:15:31] 7163 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [14:15:31] 7164 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:15:31] 7162 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [14:15:31] 7161 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:15:32] 7160 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (gnmi codfw) [14:15:32] 7156 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [14:15:32] 7158 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:15:33] 7159 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [14:15:33] 7157 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [14:15:34] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [14:15:34] 7155 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [14:15:35] 7154 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:15:40] !ack 7166 [14:15:53] !incidents [14:15:53] 7166 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [14:15:53] 7165 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [14:15:54] 7163 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [14:15:54] 7164 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:15:54] 7162 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [14:15:54] 7161 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:15:54] 7160 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (gnmi codfw) [14:15:55] 7156 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [14:15:55] 7158 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:15:56] 7159 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [14:15:56] 7157 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [14:15:57] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [14:15:57] 7155 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [14:15:58] 7154 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:16:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [14:16:57] !incidents [14:16:57] 7166 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [14:16:58] 7167 (UNACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:16:58] 7165 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [14:16:58] 7163 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [14:16:58] 7164 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:16:58] 7162 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [14:16:59] 7161 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:16:59] 7160 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (gnmi codfw) [14:16:59] 7156 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [14:17:00] 7158 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:17:00] 7159 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [14:17:01] 7157 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [14:17:01] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [14:17:02] 7155 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [14:17:07] !ack 7167 [14:18:38] 06SRE: New WMF docker registry credentials - https://phabricator.wikimedia.org/T412524 (10DPogorzelski-WMF) 03NEW [14:18:46] FIRING: Primary inbound port utilisation over 90% #page: Alert for device cr3-eqsin.wikimedia.org - Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [14:18:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-jumbo1003.eqiad.wmnet with reason: host reimage [14:19:36] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:19:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSatura [14:21:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [14:22:41] jclark@cumin1003 reimage (PID 4010144) is awaiting input [14:23:46] RESOLVED: Primary inbound port utilisation over 90% #page: Device cr3-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [14:24:06] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2166 - https://phabricator.wikimedia.org/T412497#11454731 (10Jhancock.wm) @Marostegui drive has been replaced with one from a decommed server. looks good on this end. let me know if it looks good to you [14:24:11] !incidents [14:24:11] 7168 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [14:24:11] 7167 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:24:11] 7166 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [14:24:12] 7165 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [14:24:12] 7163 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [14:24:12] 7164 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:24:12] 7162 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [14:24:12] 7161 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:24:13] 7160 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (gnmi codfw) [14:24:13] 7156 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [14:24:14] 7158 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [14:24:14] 7159 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [14:24:15] 7157 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr4-ulsfo.wikimedia.org) [14:24:15] 7150 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [14:29:00] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2166 - https://phabricator.wikimedia.org/T412497#11454735 (10Marostegui) Thanks Jenn, it is rebuilding now [14:34:29] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:37:34] jclark@cumin1003 reimage (PID 4024268) is awaiting input [14:40:02] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:51:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lvs1018: remove cross-rack links to rows A, C and D - https://phabricator.wikimedia.org/T411781#11454759 (10Jclark-ctr) 05Openβ†’03Resolved a:03Jclark-ctr cables have been disconnected and deleted from netbox [14:51:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:51:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-jumbo1003.eqiad.wmnet with OS trixie [14:51:41] !log jclark@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:51:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-jumbo1002.eqiad.wmnet with OS trixie [14:51:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11454765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ganeti-jumbo1003.... [14:51:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11454766 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ganeti-jumbo1002.... [14:52:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11454768 (10Jclark-ctr) [14:52:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11454769 (10Jclark-ctr) 05Stalledβ†’03Resolved [14:55:02] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:56:20] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11454781 (10Jclark-ctr) [15:00:02] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525 (10Jclark-ctr) 03NEW [15:04:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11454818 (10Jclark-ctr) [15:04:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11454819 (10Jclark-ctr) [15:04:36] (03PS1) 10Btullis: Allow the spark serviceaccount to manage PVCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217766 (https://phabricator.wikimedia.org/T406833) [15:05:02] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:46] (03CR) 10Btullis: [C:03+2] Allow the spark serviceaccount to manage PVCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217766 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:10:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:10:02] FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:24] (03Merged) 10jenkins-bot: Allow the spark serviceaccount to manage PVCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217766 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:15:19] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [15:15:26] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [15:27:24] (03PS1) 10Isabelle Hurbain-Palatin: Activate post-processing cache on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217768 (https://phabricator.wikimedia.org/T348255) [15:28:15] (03CR) 10CI reject: [V:04-1] Activate post-processing cache on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217768 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [15:30:21] (03PS2) 10Isabelle Hurbain-Palatin: Activate post-processing cache on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217768 (https://phabricator.wikimedia.org/T348255) [15:35:03] RESOLVED: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:32] (03CR) 10C. Scott Ananian: [C:03+1] Activate post-processing cache on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217768 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [15:40:02] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:03] RESOLVED: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:48:54] (03PS1) 10Xcollazo: Scale down a bit mw-content-history-reconcile-enrich from 20 to 18 TaskManagers. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217773 (https://phabricator.wikimedia.org/T411803) [15:49:59] (03CR) 10Dr0ptp4kt: [C:03+2] Scale down a bit mw-content-history-reconcile-enrich from 20 to 18 TaskManagers. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217773 (https://phabricator.wikimedia.org/T411803) (owner: 10Xcollazo) [15:50:06] (03CR) 10JavierMonton: [C:03+2] Scale down a bit mw-content-history-reconcile-enrich from 20 to 18 TaskManagers. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217773 (https://phabricator.wikimedia.org/T411803) (owner: 10Xcollazo) [15:51:49] (03Merged) 10jenkins-bot: Scale down a bit mw-content-history-reconcile-enrich from 20 to 18 TaskManagers. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217773 (https://phabricator.wikimedia.org/T411803) (owner: 10Xcollazo) [15:55:27] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [15:55:39] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [15:56:03] (03PS1) 10GergΕ‘ Tisza: Remove LoggedOut cookie handling [puppet] - 10https://gerrit.wikimedia.org/r/1217774 (https://phabricator.wikimedia.org/T142542) [16:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:14:35] (03PS1) 10DLynch: product_metrics.contributors.experiments stream needs use_edge_uniques [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217783 (https://phabricator.wikimedia.org/T405177) [16:16:07] (03CR) 10Sergio Gimeno: "Does this affect the Revise tone experiment setup which is for logged in experiments?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217783 (https://phabricator.wikimedia.org/T405177) (owner: 10DLynch) [16:18:05] (03CR) 10DLynch: "If yours is only for logged in, I think it won't make a difference? Based on https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Con" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217783 (https://phabricator.wikimedia.org/T405177) (owner: 10DLynch) [16:22:46] (03CR) 10Bearloga: [C:03+1] product_metrics.contributors.experiments stream needs use_edge_uniques [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217783 (https://phabricator.wikimedia.org/T405177) (owner: 10DLynch) [16:23:29] (03CR) 10Bearloga: [C:03+1] "Correct, this is just to enable population of subject ID for edge uniques-based experiments. Revise Tone won't be affected." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217783 (https://phabricator.wikimedia.org/T405177) (owner: 10DLynch) [16:31:01] (03PS1) 10Jsn.sherman: extension-list: Add PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217786 (https://phabricator.wikimedia.org/T412528) [16:31:03] (03PS1) 10Jsn.sherman: InitialiseSettings.php: Add wmgUsePersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217787 (https://phabricator.wikimedia.org/T412528) [16:31:05] (03PS1) 10Jsn.sherman: InitialiseSettings-labs.php: Deploy PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217788 (https://phabricator.wikimedia.org/T412528) [16:31:08] (03PS1) 10Jsn.sherman: CommonSettings-labs: Load PersonalDashbard extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217789 (https://phabricator.wikimedia.org/T412528) [16:31:57] (03CR) 10CI reject: [V:04-1] InitialiseSettings-labs.php: Deploy PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217788 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [16:31:58] (03CR) 10CI reject: [V:04-1] InitialiseSettings.php: Add wmgUsePersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217787 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [16:32:09] (03CR) 10CI reject: [V:04-1] CommonSettings-labs: Load PersonalDashbard extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217789 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [16:33:04] (03PS1) 10GergΕ‘ Tisza: Remove LoggedOut cookie logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217790 (https://phabricator.wikimedia.org/T142542) [16:33:26] (03PS1) 10DLynch: mobileSectionSwitch: experiment name change [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217791 (https://phabricator.wikimedia.org/T410803) [16:35:58] (03PS2) 10Jsn.sherman: InitialiseSettings-labs.php: Deploy PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217788 (https://phabricator.wikimedia.org/T412528) [16:35:58] (03PS2) 10Jsn.sherman: CommonSettings-labs: Load PersonalDashbard extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217789 (https://phabricator.wikimedia.org/T412528) [16:39:18] (03PS2) 10Jsn.sherman: InitialiseSettings.php: Add wmgUsePersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217787 (https://phabricator.wikimedia.org/T412528) [16:39:18] (03PS3) 10Jsn.sherman: InitialiseSettings-labs.php: Deploy PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217788 (https://phabricator.wikimedia.org/T412528) [16:39:18] (03PS3) 10Jsn.sherman: CommonSettings-labs: Load PersonalDashbard extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217789 (https://phabricator.wikimedia.org/T412528) [16:39:43] (03CR) 10Santiago Faci: [C:03+1] mobileSectionSwitch: experiment name change [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217791 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [16:45:57] (03PS1) 10Btullis: Add resources to the spark executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217792 (https://phabricator.wikimedia.org/T406833) [16:46:30] (03CR) 10Sergio Gimeno: "Ack, ty!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217783 (https://phabricator.wikimedia.org/T405177) (owner: 10DLynch) [16:47:31] (03PS2) 10Ahonc: Change votewiki language to Ukrainian. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217760 (https://phabricator.wikimedia.org/T412521) [16:49:56] (03CR) 10Btullis: [C:03+2] Add resources to the spark executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217792 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [16:51:49] (03Merged) 10jenkins-bot: Add resources to the spark executor pod template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217792 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [16:54:26] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [16:54:34] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [16:56:11] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on db2166 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [17:09:47] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:16:15] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2005.codfw.wmnet with OS bookworm [17:18:12] (03PS4) 10Pppery: Rename various locales so their translations can actually be found [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393 [17:21:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217783 (https://phabricator.wikimedia.org/T405177) (owner: 10DLynch) [17:21:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217791 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [17:23:26] (03PS1) 10Kimberly Sarabia: Decommission Article Summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217799 (https://phabricator.wikimedia.org/T411558) [17:26:50] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker2005.codfw.wmnet with reason: host reimage [17:30:34] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker2005.codfw.wmnet with reason: host reimage [17:33:32] (03Abandoned) 10Santiago Faci: Rename `mpic` local service to `test-kitchen` because of the platform renaming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [17:35:42] (03PS3) 10Santiago Faci: Deploy TestKitchen to Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [17:35:55] (03CR) 10Santiago Faci: "The other similar change I mentioned before has been abandoned in favour of this one to update how MediaWiki consumes Test Kitchen API" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [17:48:33] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [17:49:47] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:51:37] jhancock@cumin1003 reimage (PID 4105277) is awaiting input [18:16:49] (03CR) 10Ssingh: ats: gerrit: don't validate TLS host for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [18:17:36] (03CR) 10Dzahn: [V:03+1 C:03+1] "turns out stunnel4 package is already installed on deployment servers, so this should simply work, without changing anything to rsync::qui" [puppet] - 10https://gerrit.wikimedia.org/r/1217594 (https://phabricator.wikimedia.org/T289858) (owner: 10Dzahn) [18:39:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:01:37] (03PS1) 10Dzahn: gerrit: get a TLS cert from internal intermediate CA (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1217808 [19:04:40] (03PS1) 10Dzahn: pki: create intermediate CA for gerrit (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1217809 [19:04:52] (03CR) 10Vgutierrez: [C:04-1] ats: gerrit: don't validate TLS host for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [19:12:09] (03CR) 10Dzahn: ats: gerrit: don't validate TLS host for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [19:13:04] (03CR) 10Dzahn: "re: inline comments here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1215684/4/hieradata/common/profile/trafficserver/backend.ya" [puppet] - 10https://gerrit.wikimedia.org/r/1217809 (owner: 10Dzahn) [19:13:30] (03CR) 10Dzahn: "https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Adding_a_new_intermediate" [puppet] - 10https://gerrit.wikimedia.org/r/1217809 (owner: 10Dzahn) [19:30:04] (03CR) 10Vgutierrez: [C:04-1] ats: gerrit: don't validate TLS host for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [19:32:17] (03CR) 10Dzahn: ats: gerrit: don't validate TLS host for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [19:35:39] FIRING: KubernetesAPILatency: High Kubernetes API latency (PUT flinkdeployments) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-dse&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:35:56] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [19:35:57] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker2005.codfw.wmnet with OS bookworm [19:36:04] 10ops-codfw, 06SRE, 06DC-Ops, 07Essential-Work: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11455780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host dse-k8s-worker2005.codfw.wmnet with OS bookworm completed: - ds... [19:36:34] 10ops-codfw, 06SRE, 06DC-Ops, 07Essential-Work: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11455782 (10Jhancock.wm) 05Openβ†’03Resolved [19:37:04] 10ops-codfw, 06SRE, 06DC-Ops, 07Essential-Work: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11455785 (10Jhancock.wm) a:03Jhancock.wm @BTullis these are completed. [19:45:39] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PUT flinkdeployments) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-dse&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:54:40] (03CR) 10Pppery: [C:04-1] "Forgot `arc liberate`, whoops." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393 (owner: 10Pppery) [20:00:49] (03PS5) 10Pppery: Rename various locales so their translations can actually be found [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393 [20:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:18:47] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 486780792 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:22:47] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3680 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:27:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86514 and previous config saved to /var/cache/conftool/dbconfig/20251212-202720-marostegui.json [20:27:26] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:27:26] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:37:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86515 and previous config saved to /var/cache/conftool/dbconfig/20251212-203715-marostegui.json [20:37:20] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:37:21] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:42:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P86516 and previous config saved to /var/cache/conftool/dbconfig/20251212-204228-marostegui.json [20:52:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P86517 and previous config saved to /var/cache/conftool/dbconfig/20251212-205223-marostegui.json [20:57:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P86518 and previous config saved to /var/cache/conftool/dbconfig/20251212-205737-marostegui.json [21:07:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P86519 and previous config saved to /var/cache/conftool/dbconfig/20251212-210731-marostegui.json [21:08:07] (03PS1) 10Dzahn: ncredir: add redirects for www.wikipedia25.(org|com) [puppet] - 10https://gerrit.wikimedia.org/r/1217810 (https://phabricator.wikimedia.org/T408168) [21:09:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:12:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86520 and previous config saved to /var/cache/conftool/dbconfig/20251212-211245-marostegui.json [21:12:50] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [21:12:51] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [21:13:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [21:13:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1162 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86521 and previous config saved to /var/cache/conftool/dbconfig/20251212-211309-marostegui.json [21:19:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:20:27] (03PS8) 10Dzahn: unlink wikipedia25.[org|com] from ncredir, point to k8s-ingress [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) [21:22:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86522 and previous config saved to /var/cache/conftool/dbconfig/20251212-212240-marostegui.json [21:22:46] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [21:22:46] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [21:22:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [21:23:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2175 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86523 and previous config saved to /var/cache/conftool/dbconfig/20251212-212305-marostegui.json [21:25:30] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11456122 (10Dzahn) Thanks for the clarification @ATitkov I uploaded new versions of patches trying to make it possible to fulfill these requ... [21:32:12] !log removing 4 files for legal compliance [21:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:55:37] (03CR) 10Pppery: [C:03+1] ncredir: add redirects for www.wikipedia25.(org|com) [puppet] - 10https://gerrit.wikimedia.org/r/1217810 (https://phabricator.wikimedia.org/T408168) (owner: 10Dzahn) [21:56:24] (03CR) 10Pppery: [C:03+1] "Probably for all ncredir domains `www.foo.com` should redirect to the same place as `foo.com`. Could that be done for everything instead o" [puppet] - 10https://gerrit.wikimedia.org/r/1217810 (https://phabricator.wikimedia.org/T408168) (owner: 10Dzahn) [21:58:49] (03CR) 10Pppery: [C:03+1] "In fact, looking at the standard entry 2 below, going to https://www.wikithinkersinc.com redirects to the paid editing blog post. Why isn'" [puppet] - 10https://gerrit.wikimedia.org/r/1217810 (https://phabricator.wikimedia.org/T408168) (owner: 10Dzahn) [22:01:43] (03CR) 10Dzahn: "Because it's not using a wildcard, *, in the rule." [puppet] - 10https://gerrit.wikimedia.org/r/1217810 (https://phabricator.wikimedia.org/T408168) (owner: 10Dzahn) [22:03:07] (03PS1) 10Dzahn: wikistats: add an update job for new "gp" table [puppet] - 10https://gerrit.wikimedia.org/r/1217812 (https://phabricator.wikimedia.org/T409014) [22:03:39] (03CR) 10Dzahn: [C:03+2] "cloud VPS - only" [puppet] - 10https://gerrit.wikimedia.org/r/1217812 (https://phabricator.wikimedia.org/T409014) (owner: 10Dzahn) [22:10:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSatu [22:11:02] !incidents [22:11:03] 7169 (ACKED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [22:11:03] 7168 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [22:11:03] 7167 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [22:11:03] 7166 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [22:11:03] 7165 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [22:11:04] 7163 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [22:11:04] 7164 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [22:11:04] 7162 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [22:11:04] 7161 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [22:11:05] 7160 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (gnmi codfw) [22:11:05] 7156 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [22:11:06] 7158 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [22:11:11] o/ [22:11:14] o/ [22:15:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [22:16:45] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [22:20:51] RESOLVED: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [22:21:45] FIRING: [2x] Primary outbound port utilisation over 90% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [22:26:45] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [22:36:05] (03CR) 10Pppery: [C:03+1] "Oh, right, that makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/1217810 (https://phabricator.wikimedia.org/T408168) (owner: 10Dzahn) [22:39:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:49:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86524 and previous config saved to /var/cache/conftool/dbconfig/20251212-224903-marostegui.json [22:49:09] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:49:10] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:04:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P86525 and previous config saved to /var/cache/conftool/dbconfig/20251212-230412-marostegui.json [23:16:25] !log removing 4 files for legal compliance [23:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P86526 and previous config saved to /var/cache/conftool/dbconfig/20251212-231920-marostegui.json [23:22:38] !log removing 1 file for legal compliance [23:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86527 and previous config saved to /var/cache/conftool/dbconfig/20251212-233428-marostegui.json [23:34:34] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [23:34:35] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [23:34:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [23:34:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1182 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86528 and previous config saved to /var/cache/conftool/dbconfig/20251212-233453-marostegui.json