[00:00:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [00:02:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:03:05] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11456294 (10VRiley-WMF) [00:04:04] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11456304 (10VRiley-WMF) PDUs 9-12 should be set up according to the checklist. 13 and 14 do not have power at all. [00:07:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:12:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:29:31] (03CR) 10Ori: [C:03+1] Remove LoggedOut cookie handling [puppet] - 10https://gerrit.wikimedia.org/r/1217774 (https://phabricator.wikimedia.org/T142542) (owner: 10Gergő Tisza) [00:31:29] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [00:35:29] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 1 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [00:39:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:40:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217816 [00:40:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217816 (owner: 10TrainBranchBot) [00:42:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:47:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:47:41] PROBLEM - ganeti-noded running on ganeti1037 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [00:48:41] RECOVERY - ganeti-noded running on ganeti1037 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [00:53:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217816 (owner: 10TrainBranchBot) [00:57:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:00:43] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:07:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:10:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217817 [01:10:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217817 (owner: 10TrainBranchBot) [01:18:26] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 17m 42s) [01:18:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [01:20:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [01:21:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [01:22:02] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:23:52] !incidents [01:23:52] 7171 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [01:23:52] 7172 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [01:23:52] 7170 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [01:23:53] 7169 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [01:23:53] 7168 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [01:23:53] 7167 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [01:23:53] 7166 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [01:23:53] 7165 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [01:23:54] 7163 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [01:23:54] 7164 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [01:23:55] 7162 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [01:23:55] 7161 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [01:23:56] 7160 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (gnmi codfw) [01:23:56] 7156 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-eqiad:9804 Transit: NTT (234630) {#3475} xe-3/1/6 gnmi eqiad) [01:34:39] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217817 (owner: 10TrainBranchBot) [01:36:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [01:38:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [01:50:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:16:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86529 and previous config saved to /var/cache/conftool/dbconfig/20251213-041633-marostegui.json [04:16:40] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:16:41] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:31:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P86530 and previous config saved to /var/cache/conftool/dbconfig/20251213-043141-marostegui.json [04:46:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P86531 and previous config saved to /var/cache/conftool/dbconfig/20251213-044649-marostegui.json [05:01:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86532 and previous config saved to /var/cache/conftool/dbconfig/20251213-050158-marostegui.json [05:02:04] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:02:04] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:02:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2189.codfw.wmnet with reason: Maintenance [05:02:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2189 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86533 and previous config saved to /var/cache/conftool/dbconfig/20251213-050223-marostegui.json [05:10:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:44:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T410589)', diff saved to https://phabricator.wikimedia.org/P86534 and previous config saved to /var/cache/conftool/dbconfig/20251213-054433-ladsgroup.json [05:44:38] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [05:50:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:59:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P86535 and previous config saved to /var/cache/conftool/dbconfig/20251213-055942-ladsgroup.json [06:14:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P86536 and previous config saved to /var/cache/conftool/dbconfig/20251213-061450-ladsgroup.json [06:29:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T410589)', diff saved to https://phabricator.wikimedia.org/P86537 and previous config saved to /var/cache/conftool/dbconfig/20251213-062958-ladsgroup.json [06:30:03] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:30:15] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [06:30:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T410589)', diff saved to https://phabricator.wikimedia.org/P86538 and previous config saved to /var/cache/conftool/dbconfig/20251213-063023-ladsgroup.json [06:48:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86539 and previous config saved to /var/cache/conftool/dbconfig/20251213-064856-marostegui.json [06:49:04] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:49:04] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:04:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P86540 and previous config saved to /var/cache/conftool/dbconfig/20251213-070405-marostegui.json [07:19:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P86541 and previous config saved to /var/cache/conftool/dbconfig/20251213-071913-marostegui.json [07:22:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [07:24:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSatu [07:27:15] !incidents [07:27:15] 7173 (UNACKED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [07:27:15] 7174 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [07:27:15] 7171 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:27:15] 7172 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:27:16] 7170 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:27:16] 7169 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [07:27:16] 7168 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [07:27:16] 7167 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:27:16] 7166 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [07:27:17] 7165 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:27:17] 7163 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:27:18] 7164 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:27:34] !ack 7174 [07:27:35] 7174 (ACKED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [07:27:37] !ack 7173 [07:27:37] 7173 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [07:34:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86542 and previous config saved to /var/cache/conftool/dbconfig/20251213-073421-marostegui.json [07:34:27] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [07:34:27] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [07:34:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [07:34:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1188 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86543 and previous config saved to /var/cache/conftool/dbconfig/20251213-073445-marostegui.json [07:39:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSa [07:44:00] !incidents [07:44:00] 7173 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [07:44:00] 7174 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [07:44:00] 7171 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:44:01] 7172 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:44:01] 7170 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:44:01] 7169 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [07:44:01] 7168 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [07:44:01] 7167 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:44:02] 7166 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [07:44:02] 7165 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:44:03] 7163 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:44:03] 7164 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:44:04] 7162 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:44:04] 7161 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:45:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSatu [07:48:11] !incidents [07:48:11] 7173 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [07:48:11] 7175 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [07:48:12] 7174 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [07:48:12] 7171 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:48:12] 7172 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:48:12] 7170 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:48:12] 7169 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [07:48:13] 7168 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [07:48:13] 7167 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:48:14] 7166 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [07:48:14] 7165 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:48:15] 7163 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:48:15] 7164 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [07:48:16] 7162 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:48:26] !ack 7175 [07:50:06] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSa [07:52:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [08:00:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSatu [08:01:10] !incidents [08:01:10] 7176 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [08:01:10] 7173 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [08:01:10] 7175 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [08:01:11] 7174 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [08:01:11] 7171 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [08:01:11] 7172 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [08:01:11] 7170 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [08:01:11] 7169 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [08:01:12] 7168 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [08:01:12] 7167 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [08:01:13] 7166 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [08:01:13] 7165 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [08:01:14] 7163 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [08:01:14] 7164 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [08:01:20] !ack 7176 [08:02:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [08:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:13:42] !incidents [08:13:43] 7176 (ACKED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [08:13:43] 7177 (UNACKED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [08:13:43] 7173 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [08:13:43] 7175 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [08:13:43] 7174 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [08:13:44] 7171 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [08:13:44] 7172 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [08:13:44] 7170 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [08:13:44] 7169 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [08:13:45] 7168 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [08:13:45] 7167 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [08:13:46] 7166 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [08:13:49] !ack 7177 [08:14:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:19:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:58:09] !incidents [08:58:10] 7176 (ACKED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [08:58:10] 7177 (ACKED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [08:58:10] 7173 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr2-codfw.wikimedia.org) [08:58:10] 7175 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [08:58:10] 7174 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [08:58:11] 7171 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [08:58:11] 7172 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [08:58:11] 7170 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [08:58:11] 7169 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-codfw:9804 Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1} xe-0/1/1:0 gnmi codfw) [08:58:12] 7168 (RESOLVED) Primary inbound port utilisation over 90% (paged) network noc (cr3-eqsin.wikimedia.org) [08:58:12] 7167 (RESOLVED) Primary outbound port utilisation over 90% (paged) network noc (cr1-codfw.wikimedia.org) [08:58:13] 7166 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [09:00:21] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSa [09:03:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86544 and previous config saved to /var/cache/conftool/dbconfig/20251213-090355-marostegui.json [09:04:01] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [09:04:02] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [09:19:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P86545 and previous config saved to /var/cache/conftool/dbconfig/20251213-091903-marostegui.json [09:34:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P86546 and previous config saved to /var/cache/conftool/dbconfig/20251213-093412-marostegui.json [09:49:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86547 and previous config saved to /var/cache/conftool/dbconfig/20251213-094920-marostegui.json [09:49:26] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [09:49:27] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [09:49:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [09:49:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1197 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86548 and previous config saved to /var/cache/conftool/dbconfig/20251213-094944-marostegui.json [09:50:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:37:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86549 and previous config saved to /var/cache/conftool/dbconfig/20251213-103704-marostegui.json [10:37:10] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [10:37:11] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [10:52:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P86550 and previous config saved to /var/cache/conftool/dbconfig/20251213-105212-marostegui.json [10:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:07:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P86551 and previous config saved to /var/cache/conftool/dbconfig/20251213-110720-marostegui.json [11:19:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86552 and previous config saved to /var/cache/conftool/dbconfig/20251213-111900-marostegui.json [11:19:06] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [11:19:06] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [11:22:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86553 and previous config saved to /var/cache/conftool/dbconfig/20251213-112229-marostegui.json [11:22:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [11:34:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P86554 and previous config saved to /var/cache/conftool/dbconfig/20251213-113408-marostegui.json [11:49:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P86555 and previous config saved to /var/cache/conftool/dbconfig/20251213-114916-marostegui.json [12:04:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86556 and previous config saved to /var/cache/conftool/dbconfig/20251213-120425-marostegui.json [12:04:30] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [12:04:31] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [12:04:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [12:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:19:11] (03PS1) 10Cathal Mooney: Rancid: remove old eqiad row c/d Juniper switches [puppet] - 10https://gerrit.wikimedia.org/r/1217822 [12:53:17] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:48:43] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:48:47] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:49:47] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:49:53] PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [13:50:43] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:50:47] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:51:43] RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Thanos [13:51:49] PROBLEM - Thanos swift https on thanos-fe1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [13:54:39] RECOVERY - Thanos swift https on thanos-fe1005 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Thanos [14:31:49] PROBLEM - Thanos swift https on thanos-fe1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [14:34:39] RECOVERY - Thanos swift https on thanos-fe1006 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Thanos [14:49:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (195.200.68.152) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:54:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (195.200.68.152) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:10:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:51:47] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:53:32] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:10:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2207.codfw.wmnet with reason: Maintenance [17:10:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2207 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86557 and previous config saved to /var/cache/conftool/dbconfig/20251213-171057-marostegui.json [17:11:02] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [17:11:03] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [17:54:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance [17:54:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1229 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86558 and previous config saved to /var/cache/conftool/dbconfig/20251213-175442-marostegui.json [17:54:47] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [17:54:48] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [18:17:46] (03CR) 10Cathal Mooney: [C:03+2] Rancid: remove old eqiad row c/d Juniper switches [puppet] - 10https://gerrit.wikimedia.org/r/1217822 (owner: 10Cathal Mooney) [18:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:52:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:53:32] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown