[00:02:20] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1225043 [00:12:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:13:26] (03CR) 10Pppery: "No idea what these domains are about." [puppet] - 10https://gerrit.wikimedia.org/r/1225044 (owner: 10Ncmonitor) [00:17:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:40:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1225049 [00:40:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1225049 (owner: 10TrainBranchBot) [00:45:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87037 and previous config saved to /var/cache/conftool/dbconfig/20260110-004509-marostegui.json [00:45:14] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [00:45:15] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [00:54:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1225049 (owner: 10TrainBranchBot) [00:55:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P87038 and previous config saved to /var/cache/conftool/dbconfig/20260110-005517-marostegui.json [00:59:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance [00:59:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2206 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87039 and previous config saved to /var/cache/conftool/dbconfig/20260110-005937-marostegui.json [00:59:42] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [00:59:42] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [01:00:46] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:05:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P87040 and previous config saved to /var/cache/conftool/dbconfig/20260110-010525-marostegui.json [01:10:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1225053 [01:10:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1225053 (owner: 10TrainBranchBot) [01:13:46] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 00s) [01:15:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87041 and previous config saved to /var/cache/conftool/dbconfig/20260110-011534-marostegui.json [01:15:39] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [01:15:39] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [01:15:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance [01:15:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1238 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87042 and previous config saved to /var/cache/conftool/dbconfig/20260110-011558-marostegui.json [01:24:10] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:33:07] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1225053 (owner: 10TrainBranchBot) [01:54:36] !log zabe@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [01:55:14] !log zabe@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [02:24:10] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:22:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:37:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:09:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:10] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:34:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:06] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:24:10] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:27:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:47:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:50:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:10] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:35:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:44:28] (03CR) 10A smart kitten: Load MultiTitle on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224796 (https://phabricator.wikimedia.org/T404461) (owner: 10Tbodt) [10:03:44] (03CR) 10A smart kitten: Load MultiTitle on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224796 (https://phabricator.wikimedia.org/T404461) (owner: 10Tbodt) [10:24:10] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:25:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:47:59] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:50:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87043 and previous config saved to /var/cache/conftool/dbconfig/20260110-110020-marostegui.json [11:00:26] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [11:00:26] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [11:00:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87044 and previous config saved to /var/cache/conftool/dbconfig/20260110-110039-marostegui.json [11:10:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P87045 and previous config saved to /var/cache/conftool/dbconfig/20260110-111028-marostegui.json [11:10:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P87046 and previous config saved to /var/cache/conftool/dbconfig/20260110-111047-marostegui.json [11:20:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P87047 and previous config saved to /var/cache/conftool/dbconfig/20260110-112037-marostegui.json [11:20:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P87048 and previous config saved to /var/cache/conftool/dbconfig/20260110-112055-marostegui.json [11:30:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87049 and previous config saved to /var/cache/conftool/dbconfig/20260110-113045-marostegui.json [11:30:50] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [11:30:51] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [11:31:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [11:31:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87050 and previous config saved to /var/cache/conftool/dbconfig/20260110-113104-marostegui.json [11:31:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1241 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87051 and previous config saved to /var/cache/conftool/dbconfig/20260110-113110-marostegui.json [11:31:21] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance [11:31:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2210 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87052 and previous config saved to /var/cache/conftool/dbconfig/20260110-113128-marostegui.json [11:42:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturati [11:45:11] need little time to get to my laptop [11:46:04] !incidents [11:46:05] 7304 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [11:52:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSatura [12:06:10] !incidents [12:06:10] 7304 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [12:50:45] (03PS1) 10A smart kitten: CommonSettings-labs: Remove redundant code for loading/configuring Phonos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225075 [12:50:45] (03CR) 10A smart kitten: "Adding you as reviewers as some Community Tech folks (apologies that I don’t know exactly which folks work(ed) on Phonos :) )" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225075 (owner: 10A smart kitten) [12:56:03] PROBLEM - Host durum7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.7) [12:56:29] RECOVERY - Host durum7004 is UP: PING OK - Packet loss = 0%, RTA = 138.25 ms [12:59:25] (03CR) 10A smart kitten: "Side-note: While looking into this, I also noticed that the `wgPhonosInlineAudioPlayerMode` config setting [in `InitialiseSettings-labs.ph" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225075 (owner: 10A smart kitten) [13:01:01] PROBLEM - Host lvs7001 is DOWN: CRITICAL - Time to live exceeded (10.140.0.13) [13:01:01] PROBLEM - Host lvs7003 is DOWN: CRITICAL - Time to live exceeded (10.140.0.14) [13:01:21] RECOVERY - Host lvs7001 is UP: PING OK - Packet loss = 0%, RTA = 137.54 ms [13:01:23] RECOVERY - Host lvs7003 is UP: PING OK - Packet loss = 0%, RTA = 137.38 ms [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:10] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:43:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:43:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [13:43:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:54:56] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11509648 (10Xqt) @Fabfur: I still have the 429 problem with my bot. Here the html content: ` WARNING: Http response status 429 WARNING: Non-JSON response received... [14:24:10] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:43:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [14:43:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [14:43:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:47:59] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:52:03] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11509691 (10Joe) 05Open→03Resolved a:03Fabfur >>! In T414173#11509648, @Xqt wrote: > @Fabfur: I still have the 429 problem with my bot. Here the html con... [15:09:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:10] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:39:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [16:53:47] (03PS1) 10Zabe: Disable updates for Special:GloballyUnusedFiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225088 (https://phabricator.wikimedia.org/T414202) [17:24:10] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:13:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1222.eqiad.wmnet with reason: Maintenance [18:21:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:22:41] <_joe_> !incidents [18:22:41] 7305 (UNACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [18:22:41] 7304 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [18:23:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [18:23:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [18:24:10] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:24:10] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:26:06] !incidents [18:26:06] 7305 (UNACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [18:26:06] 7306 (UNACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [18:26:06] 7307 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [18:26:07] 7304 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [18:26:10] !ack [18:26:11] 7276 (RESOLVED) Manual (paged) by mvernon (mvernon@wikimedia.org): Unable to resolve previous incident 7275 before shift end [18:26:11] 7276 (RESOLVED) Manual (paged) by mvernon (mvernon@wikimedia.org): Unable to resolve previous incident 7275 before shift end [18:26:11] 7276 (RESOLVED) Manual (paged) by mvernon (mvernon@wikimedia.org): Unable to resolve previous incident 7275 before shift end [18:26:21] !incidents [18:26:21] 7305 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [18:26:22] 7306 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [18:26:22] 7307 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [18:26:22] 7304 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [18:26:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturati [18:26:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:27:21] !incidents [18:27:22] 7306 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [18:27:22] 7307 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [18:27:22] 7308 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [18:27:22] 7305 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [18:27:22] 7304 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [18:27:29] !ack 7308 [18:27:29] 7276 (RESOLVED) Manual (paged) by mvernon (mvernon@wikimedia.org): Unable to resolve previous incident 7275 before shift end [18:28:12] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [18:28:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [18:29:10] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:31:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSat [18:38:17] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:41:51] FIRING: [3x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [18:42:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [18:46:51] FIRING: [3x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [18:47:59] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:56:51] RESOLVED: [3x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [19:20:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [19:20:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T413525)', diff saved to https://phabricator.wikimedia.org/P87053 and previous config saved to /var/cache/conftool/dbconfig/20260110-192058-marostegui.json [19:21:01] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:27:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [19:27:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T413525)', diff saved to https://phabricator.wikimedia.org/P87054 and previous config saved to /var/cache/conftool/dbconfig/20260110-192742-marostegui.json [19:27:46] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:35:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T413525)', diff saved to https://phabricator.wikimedia.org/P87055 and previous config saved to /var/cache/conftool/dbconfig/20260110-193536-marostegui.json [19:35:40] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:45:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P87056 and previous config saved to /var/cache/conftool/dbconfig/20260110-194544-marostegui.json [19:55:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P87057 and previous config saved to /var/cache/conftool/dbconfig/20260110-195553-marostegui.json [20:06:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T413525)', diff saved to https://phabricator.wikimedia.org/P87058 and previous config saved to /var/cache/conftool/dbconfig/20260110-200601-marostegui.json [20:06:05] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [20:06:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [20:06:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1166 (T413525)', diff saved to https://phabricator.wikimedia.org/P87059 and previous config saved to /var/cache/conftool/dbconfig/20260110-200615-marostegui.json [20:17:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T413525)', diff saved to https://phabricator.wikimedia.org/P87060 and previous config saved to /var/cache/conftool/dbconfig/20260110-201714-marostegui.json [20:17:19] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [20:21:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T413525)', diff saved to https://phabricator.wikimedia.org/P87061 and previous config saved to /var/cache/conftool/dbconfig/20260110-202138-marostegui.json [20:27:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P87062 and previous config saved to /var/cache/conftool/dbconfig/20260110-202722-marostegui.json [20:29:13] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11509876 (10Benwing2) @Fabfur any idea why my pywikibot script is getting 429 errors every 200 pages it's pulling down? It pulls down about 7 pages a second, w... [20:31:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:31:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P87063 and previous config saved to /var/cache/conftool/dbconfig/20260110-203146-marostegui.json [20:37:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P87064 and previous config saved to /var/cache/conftool/dbconfig/20260110-203731-marostegui.json [20:41:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P87065 and previous config saved to /var/cache/conftool/dbconfig/20260110-204154-marostegui.json [20:46:19] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:47:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T413525)', diff saved to https://phabricator.wikimedia.org/P87066 and previous config saved to /var/cache/conftool/dbconfig/20260110-204739-marostegui.json [20:47:43] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [20:47:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [20:48:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T413525)', diff saved to https://phabricator.wikimedia.org/P87067 and previous config saved to /var/cache/conftool/dbconfig/20260110-204804-marostegui.json [20:52:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T413525)', diff saved to https://phabricator.wikimedia.org/P87068 and previous config saved to /var/cache/conftool/dbconfig/20260110-205201-marostegui.json [20:52:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [20:52:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1175 (T413525)', diff saved to https://phabricator.wikimedia.org/P87069 and previous config saved to /var/cache/conftool/dbconfig/20260110-205226-marostegui.json [20:57:15] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:57:15] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:02:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:07:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T413525)', diff saved to https://phabricator.wikimedia.org/P87070 and previous config saved to /var/cache/conftool/dbconfig/20260110-210711-marostegui.json [21:07:15] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [21:12:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:17:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P87071 and previous config saved to /var/cache/conftool/dbconfig/20260110-211720-marostegui.json [21:20:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87072 and previous config saved to /var/cache/conftool/dbconfig/20260110-212006-marostegui.json [21:20:12] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [21:20:12] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [21:23:11] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:24:10] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:27:05] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55565 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:27:05] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:27:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P87073 and previous config saved to /var/cache/conftool/dbconfig/20260110-212728-marostegui.json [21:30:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P87074 and previous config saved to /var/cache/conftool/dbconfig/20260110-213015-marostegui.json [21:37:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T413525)', diff saved to https://phabricator.wikimedia.org/P87075 and previous config saved to /var/cache/conftool/dbconfig/20260110-213700-marostegui.json [21:37:04] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [21:37:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T413525)', diff saved to https://phabricator.wikimedia.org/P87076 and previous config saved to /var/cache/conftool/dbconfig/20260110-213736-marostegui.json [21:37:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [21:38:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1189 (T413525)', diff saved to https://phabricator.wikimedia.org/P87077 and previous config saved to /var/cache/conftool/dbconfig/20260110-213801-marostegui.json [21:40:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P87078 and previous config saved to /var/cache/conftool/dbconfig/20260110-214023-marostegui.json [21:47:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P87079 and previous config saved to /var/cache/conftool/dbconfig/20260110-214708-marostegui.json [21:50:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87080 and previous config saved to /var/cache/conftool/dbconfig/20260110-215037-marostegui.json [21:50:50] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [21:50:50] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [21:50:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance [21:51:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2219 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87081 and previous config saved to /var/cache/conftool/dbconfig/20260110-215104-marostegui.json [21:51:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87082 and previous config saved to /var/cache/conftool/dbconfig/20260110-215142-marostegui.json [21:53:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T413525)', diff saved to https://phabricator.wikimedia.org/P87083 and previous config saved to /var/cache/conftool/dbconfig/20260110-215305-marostegui.json [21:53:10] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [21:53:21] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:56:19] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:57:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P87084 and previous config saved to /var/cache/conftool/dbconfig/20260110-215716-marostegui.json [22:01:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P87085 and previous config saved to /var/cache/conftool/dbconfig/20260110-220150-marostegui.json [22:03:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P87086 and previous config saved to /var/cache/conftool/dbconfig/20260110-220314-marostegui.json [22:07:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T413525)', diff saved to https://phabricator.wikimedia.org/P87087 and previous config saved to /var/cache/conftool/dbconfig/20260110-220725-marostegui.json [22:07:29] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [22:07:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [22:07:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T413525)', diff saved to https://phabricator.wikimedia.org/P87088 and previous config saved to /var/cache/conftool/dbconfig/20260110-220750-marostegui.json [22:11:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P87089 and previous config saved to /var/cache/conftool/dbconfig/20260110-221158-marostegui.json [22:13:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P87090 and previous config saved to /var/cache/conftool/dbconfig/20260110-221322-marostegui.json [22:16:21] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:20:17] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:22:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87091 and previous config saved to /var/cache/conftool/dbconfig/20260110-222207-marostegui.json [22:22:12] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:22:12] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:22:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance [22:22:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1242 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87092 and previous config saved to /var/cache/conftool/dbconfig/20260110-222231-marostegui.json [22:23:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T413525)', diff saved to https://phabricator.wikimedia.org/P87093 and previous config saved to /var/cache/conftool/dbconfig/20260110-222330-marostegui.json [22:23:34] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [22:23:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [22:23:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1198 (T413525)', diff saved to https://phabricator.wikimedia.org/P87094 and previous config saved to /var/cache/conftool/dbconfig/20260110-222354-marostegui.json [22:24:10] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:38:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T413525)', diff saved to https://phabricator.wikimedia.org/P87095 and previous config saved to /var/cache/conftool/dbconfig/20260110-223818-marostegui.json [22:38:22] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [22:47:59] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:48:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P87096 and previous config saved to /var/cache/conftool/dbconfig/20260110-224826-marostegui.json [22:58:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T413525)', diff saved to https://phabricator.wikimedia.org/P87097 and previous config saved to /var/cache/conftool/dbconfig/20260110-225807-marostegui.json [22:58:12] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [22:58:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P87098 and previous config saved to /var/cache/conftool/dbconfig/20260110-225835-marostegui.json [23:00:21] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:08:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P87099 and previous config saved to /var/cache/conftool/dbconfig/20260110-230816-marostegui.json [23:08:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T413525)', diff saved to https://phabricator.wikimedia.org/P87100 and previous config saved to /var/cache/conftool/dbconfig/20260110-230843-marostegui.json [23:08:47] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [23:09:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance [23:09:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [23:09:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1212 (T413525)', diff saved to https://phabricator.wikimedia.org/P87101 and previous config saved to /var/cache/conftool/dbconfig/20260110-230930-marostegui.json [23:12:15] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:12:15] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:18:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P87102 and previous config saved to /var/cache/conftool/dbconfig/20260110-231824-marostegui.json [23:24:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T413525)', diff saved to https://phabricator.wikimedia.org/P87103 and previous config saved to /var/cache/conftool/dbconfig/20260110-232441-marostegui.json [23:24:46] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [23:28:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T413525)', diff saved to https://phabricator.wikimedia.org/P87104 and previous config saved to /var/cache/conftool/dbconfig/20260110-232832-marostegui.json [23:28:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance [23:28:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T413525)', diff saved to https://phabricator.wikimedia.org/P87105 and previous config saved to /var/cache/conftool/dbconfig/20260110-232856-marostegui.json [23:34:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P87106 and previous config saved to /var/cache/conftool/dbconfig/20260110-233450-marostegui.json [23:35:19] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:37:05] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55565 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:37:05] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:44:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P87107 and previous config saved to /var/cache/conftool/dbconfig/20260110-234458-marostegui.json [23:55:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T413525)', diff saved to https://phabricator.wikimedia.org/P87108 and previous config saved to /var/cache/conftool/dbconfig/20260110-235506-marostegui.json [23:55:10] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [23:55:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance