[00:01:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [00:05:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [00:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:10:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86564 and previous config saved to /var/cache/conftool/dbconfig/20251214-001017-marostegui.json [00:10:23] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [00:10:23] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [00:17:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [00:18:21] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [00:23:36] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: Lumen (442550281) {#3867}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [00:24:27] !incidents [00:24:27] 7180 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [00:24:27] 7181 (RESOLVED) TransitPeeringTransportOutSaturation network sre (gnmi) [00:24:28] 7178 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [00:24:28] 7177 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [00:24:28] 7176 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [00:24:28] 7173 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [00:24:28] 7175 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [00:24:29] 7174 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [00:24:29] 7171 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [00:24:30] 7172 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [00:25:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P86565 and previous config saved to /var/cache/conftool/dbconfig/20251214-002526-marostegui.json [00:27:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [00:30:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: Lumen (442550281) {#3867}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [00:37:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [00:39:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217840 [00:39:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217840 (owner: 10TrainBranchBot) [00:40:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P86566 and previous config saved to /var/cache/conftool/dbconfig/20251214-004034-marostegui.json [00:40:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: Lumen (442550281) {#3867}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [00:42:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [00:44:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217840 (owner: 10TrainBranchBot) [00:47:47] !incidents [00:47:47] 7183 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [00:47:48] 7182 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [00:47:48] 7180 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [00:47:48] 7181 (RESOLVED) TransitPeeringTransportOutSaturation network sre (gnmi) [00:47:48] 7178 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [00:47:48] 7177 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [00:47:49] 7176 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [00:47:49] 7173 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [00:47:49] 7175 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [00:47:50] 7174 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [00:47:50] 7171 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [00:47:51] 7172 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [00:52:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:53:32] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:55:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86567 and previous config saved to /var/cache/conftool/dbconfig/20251214-005542-marostegui.json [00:55:48] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [00:55:49] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [00:55:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1233.eqiad.wmnet with reason: Maintenance [00:56:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1233 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86568 and previous config saved to /var/cache/conftool/dbconfig/20251214-005607-marostegui.json [01:00:46] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:09:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217843 [01:09:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217843 (owner: 10TrainBranchBot) [01:13:32] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T410589)', diff saved to https://phabricator.wikimedia.org/P86569 and previous config saved to /var/cache/conftool/dbconfig/20251214-011331-ladsgroup.json [01:13:35] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:15:09] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217843 (owner: 10TrainBranchBot) [01:20:26] (03PS1) 10Pppery: WIP: Extract strings from US English locale as source strings and apply PLURAL [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) [01:23:23] (03CR) 10Pppery: "This is not ready to be merged set (and won't be until the next time WMF pulls upstream Phorge, since I want to sort out some upstream stu" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) (owner: 10Pppery) [01:26:58] (03PS2) 10Pppery: WIP: Extract strings from US English locale as source strings and apply PLURAL [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) [01:28:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P86570 and previous config saved to /var/cache/conftool/dbconfig/20251214-012839-ladsgroup.json [01:30:28] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 29m 42s) [01:43:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P86571 and previous config saved to /var/cache/conftool/dbconfig/20251214-014348-ladsgroup.json [01:58:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T410589)', diff saved to https://phabricator.wikimedia.org/P86572 and previous config saved to /var/cache/conftool/dbconfig/20251214-015856-ladsgroup.json [01:59:01] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:59:12] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [01:59:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T410589)', diff saved to https://phabricator.wikimedia.org/P86573 and previous config saved to /var/cache/conftool/dbconfig/20251214-015920-ladsgroup.json [02:30:18] (03PS1) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422) [02:30:41] (03PS2) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422) [02:32:29] (03CR) 10Pppery: Handle languages with nonstandard plural rules (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422) (owner: 10Pppery) [02:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:52:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:53:32] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:10:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:42:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86574 and previous config saved to /var/cache/conftool/dbconfig/20251214-054202-marostegui.json [05:42:08] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:42:09] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:57:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P86575 and previous config saved to /var/cache/conftool/dbconfig/20251214-055711-marostegui.json [06:12:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P86576 and previous config saved to /var/cache/conftool/dbconfig/20251214-061219-marostegui.json [06:27:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86577 and previous config saved to /var/cache/conftool/dbconfig/20251214-062727-marostegui.json [06:27:33] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:27:33] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:27:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2226.codfw.wmnet with reason: Maintenance [06:27:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2226 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86578 and previous config saved to /var/cache/conftool/dbconfig/20251214-062752-marostegui.json [06:54:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86579 and previous config saved to /var/cache/conftool/dbconfig/20251214-065432-marostegui.json [06:54:38] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:54:38] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:09:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P86580 and previous config saved to /var/cache/conftool/dbconfig/20251214-070940-marostegui.json [07:24:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P86581 and previous config saved to /var/cache/conftool/dbconfig/20251214-072449-marostegui.json [07:39:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86582 and previous config saved to /var/cache/conftool/dbconfig/20251214-073957-marostegui.json [07:40:03] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [07:40:03] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [07:40:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [07:45:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86583 and previous config saved to /var/cache/conftool/dbconfig/20251214-074526-marostegui.json [07:45:31] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [07:45:32] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251214T0800) [08:00:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P86584 and previous config saved to /var/cache/conftool/dbconfig/20251214-080034-marostegui.json [08:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:15:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P86585 and previous config saved to /var/cache/conftool/dbconfig/20251214-081543-marostegui.json [08:30:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86586 and previous config saved to /var/cache/conftool/dbconfig/20251214-083051-marostegui.json [08:30:57] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:30:57] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:31:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2238.codfw.wmnet with reason: Maintenance [08:31:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2238 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86587 and previous config saved to /var/cache/conftool/dbconfig/20251214-083116-marostegui.json [08:52:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:53:32] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:35:14] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:07:46] FIRING: Primary inbound port utilisation over 90% #page: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [11:08:55] !incidents [11:08:56] 7188 (UNACKED) Primary inbound port utilisation over 90% (paged) network noc (cr1-esams.wikimedia.org) [11:08:56] 7183 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [11:08:56] 7182 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [11:08:56] 7180 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-eqiad.wikimedia.org) [11:08:56] 7181 (RESOLVED) TransitPeeringTransportOutSaturation network sre (gnmi) [11:08:57] 7178 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [11:09:01] !ack 7188 [11:09:01] 7188 (ACKED) Primary inbound port utilisation over 90% (paged) network noc (cr1-esams.wikimedia.org) [11:12:46] RESOLVED: Primary inbound port utilisation over 90% #page: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [12:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:52:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:53:32] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:28:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1254.eqiad.wmnet with reason: Maintenance [13:28:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1254 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86588 and previous config saved to /var/cache/conftool/dbconfig/20251214-132817-marostegui.json [13:28:23] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [13:28:23] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [13:35:14] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86589 and previous config saved to /var/cache/conftool/dbconfig/20251214-141235-marostegui.json [14:12:41] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:12:42] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [14:27:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P86590 and previous config saved to /var/cache/conftool/dbconfig/20251214-142743-marostegui.json [14:42:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P86591 and previous config saved to /var/cache/conftool/dbconfig/20251214-144251-marostegui.json [14:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:58:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86592 and previous config saved to /var/cache/conftool/dbconfig/20251214-145800-marostegui.json [14:58:05] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:58:06] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:10:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:52:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:53:32] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:01:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:06:44] FIRING: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:26:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86593 and previous config saved to /var/cache/conftool/dbconfig/20251214-192623-marostegui.json [19:26:29] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [19:26:30] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [19:31:44] FIRING: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:36:44] RESOLVED: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:39:55] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 70286392 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:40:55] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3414328 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:41:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P86594 and previous config saved to /var/cache/conftool/dbconfig/20251214-194132-marostegui.json [19:56:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P86595 and previous config saved to /var/cache/conftool/dbconfig/20251214-195640-marostegui.json [20:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:11:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86596 and previous config saved to /var/cache/conftool/dbconfig/20251214-201148-marostegui.json [20:11:55] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:11:55] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:12:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1259.eqiad.wmnet with reason: Maintenance [20:12:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1259 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86597 and previous config saved to /var/cache/conftool/dbconfig/20251214-201213-marostegui.json [20:28:10] (03PS3) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422) [20:37:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T410589)', diff saved to https://phabricator.wikimedia.org/P86598 and previous config saved to /var/cache/conftool/dbconfig/20251214-203700-ladsgroup.json [20:37:05] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [20:52:03] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:52:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P86599 and previous config saved to /var/cache/conftool/dbconfig/20251214-205208-ladsgroup.json [20:53:32] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:57:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:07:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P86600 and previous config saved to /var/cache/conftool/dbconfig/20251214-210717-ladsgroup.json [21:22:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T410589)', diff saved to https://phabricator.wikimedia.org/P86601 and previous config saved to /var/cache/conftool/dbconfig/20251214-212226-ladsgroup.json [21:22:31] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [21:22:31] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [21:22:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T410589)', diff saved to https://phabricator.wikimedia.org/P86602 and previous config saved to /var/cache/conftool/dbconfig/20251214-212240-ladsgroup.json [21:47:43] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:10:35] (03PS3) 10Pppery: WIP: Extract strings from US English locale as source strings and apply PLURAL [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) [22:44:36] (03PS1) 10Pppery: WIP escape $ symbol as $$ [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217868 [22:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:05:23] (03PS1) 10Pppery: Rm frequency.json file [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217870 [23:08:57] (03PS2) 10Pppery: Rm frequency.json file [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217870 [23:12:25] (03PS1) 10Pppery: Rm frequency.json file [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217871 [23:15:21] (03Abandoned) 10Pppery: Rm frequency.json file [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217870 (owner: 10Pppery) [23:16:59] (03PS1) 10Pppery: Rm "translations" into English [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217872 [23:23:46] (03CR) 10Pppery: "FYI I'm aware I've been giving you an enormous amount of code to review in this normally quiet repository with few other code reviewers in" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) (owner: 10Pppery) [23:41:46] (03PS1) 10Pppery: Add an internal translation file for this repo's own strings [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217873 [23:42:10] (03PS2) 10Pppery: Add an internal translation file for this repo's own strings [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217873