[00:05:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:07:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183244 [00:08:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183244 (owner: 10TrainBranchBot) [00:12:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P82232 and previous config saved to /var/cache/conftool/dbconfig/20250831-001242-ladsgroup.json [00:27:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P82233 and previous config saved to /var/cache/conftool/dbconfig/20250831-002750-ladsgroup.json [00:31:57] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183244 (owner: 10TrainBranchBot) [00:42:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T402925)', diff saved to https://phabricator.wikimedia.org/P82234 and previous config saved to /var/cache/conftool/dbconfig/20250831-004257-ladsgroup.json [00:43:04] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [00:43:13] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1233.eqiad.wmnet with reason: Maintenance [00:43:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1233 (T402925)', diff saved to https://phabricator.wikimedia.org/P82235 and previous config saved to /var/cache/conftool/dbconfig/20250831-004320-ladsgroup.json [01:02:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:05:05] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:09:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T402925)', diff saved to https://phabricator.wikimedia.org/P82236 and previous config saved to /var/cache/conftool/dbconfig/20250831-010954-ladsgroup.json [01:10:00] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:21:47] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11134249 (10Don-vip) I still have [[ https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/jobs/601049 | an error ]] in my unit tests from Gitlab CI wh... [01:25:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P82237 and previous config saved to /var/cache/conftool/dbconfig/20250831-012501-ladsgroup.json [01:34:15] (03PS4) 10Pppery: Varnish: Fix rate limit comment to match code [puppet] - 10https://gerrit.wikimedia.org/r/1183245 (https://phabricator.wikimedia.org/T400119) [01:40:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P82238 and previous config saved to /var/cache/conftool/dbconfig/20250831-014009-ladsgroup.json [01:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:55:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T402925)', diff saved to https://phabricator.wikimedia.org/P82239 and previous config saved to /var/cache/conftool/dbconfig/20250831-015516-ladsgroup.json [01:55:23] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:55:32] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [02:23:01] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1254.eqiad.wmnet with reason: Maintenance [02:23:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1254 (T402925)', diff saved to https://phabricator.wikimedia.org/P82240 and previous config saved to /var/cache/conftool/dbconfig/20250831-022308-ladsgroup.json [02:23:14] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:27:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Novem Linguae - https://phabricator.wikimedia.org/T403336 (10Novem_Linguae) 03NEW [02:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [02:49:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T402925)', diff saved to https://phabricator.wikimedia.org/P82241 and previous config saved to /var/cache/conftool/dbconfig/20250831-024947-ladsgroup.json [02:49:53] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:53:17] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:04:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:04:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P82242 and previous config saved to /var/cache/conftool/dbconfig/20250831-030455-ladsgroup.json [03:20:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P82243 and previous config saved to /var/cache/conftool/dbconfig/20250831-032003-ladsgroup.json [03:29:36] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:35:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T402925)', diff saved to https://phabricator.wikimedia.org/P82244 and previous config saved to /var/cache/conftool/dbconfig/20250831-033510-ladsgroup.json [03:35:17] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [03:35:26] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1259.eqiad.wmnet with reason: Maintenance [03:35:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1259 (T402925)', diff saved to https://phabricator.wikimedia.org/P82245 and previous config saved to /var/cache/conftool/dbconfig/20250831-033533-ladsgroup.json [03:43:58] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11134317 (10phaultfinder) [03:48:53] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11134318 (10phaultfinder) [04:05:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T402925)', diff saved to https://phabricator.wikimedia.org/P82246 and previous config saved to /var/cache/conftool/dbconfig/20250831-040509-ladsgroup.json [04:05:16] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:20:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P82247 and previous config saved to /var/cache/conftool/dbconfig/20250831-042017-ladsgroup.json [04:35:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P82248 and previous config saved to /var/cache/conftool/dbconfig/20250831-043524-ladsgroup.json [04:50:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T402925)', diff saved to https://phabricator.wikimedia.org/P82249 and previous config saved to /var/cache/conftool/dbconfig/20250831-045032-ladsgroup.json [04:50:38] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:50:48] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:03:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:49] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:20:41] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [05:20:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2148 (T402925)', diff saved to https://phabricator.wikimedia.org/P82250 and previous config saved to /var/cache/conftool/dbconfig/20250831-052048-ladsgroup.json [05:20:54] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [05:21:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:33:40] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:55:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T402925)', diff saved to https://phabricator.wikimedia.org/P82251 and previous config saved to /var/cache/conftool/dbconfig/20250831-055506-ladsgroup.json [05:55:17] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [06:10:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P82252 and previous config saved to /var/cache/conftool/dbconfig/20250831-061013-ladsgroup.json [06:25:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P82253 and previous config saved to /var/cache/conftool/dbconfig/20250831-062521-ladsgroup.json [06:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [06:40:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T402925)', diff saved to https://phabricator.wikimedia.org/P82254 and previous config saved to /var/cache/conftool/dbconfig/20250831-064028-ladsgroup.json [06:40:35] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [06:40:45] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [06:40:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2175 (T402925)', diff saved to https://phabricator.wikimedia.org/P82255 and previous config saved to /var/cache/conftool/dbconfig/20250831-064052-ladsgroup.json [06:44:32] (03PS1) 10Novem Linguae: data.yaml: change wiki replica to mediawiki replica [puppet] - 10https://gerrit.wikimedia.org/r/1183247 [06:45:59] (03CR) 10CI reject: [V:04-1] data.yaml: change wiki replica to mediawiki replica [puppet] - 10https://gerrit.wikimedia.org/r/1183247 (owner: 10Novem Linguae) [06:53:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250831T0700) [07:03:58] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:04:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:04:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:05:01] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:05:44] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [07:05:55] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:08:58] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:09:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:10:44] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [07:11:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T402925)', diff saved to https://phabricator.wikimedia.org/P82256 and previous config saved to /var/cache/conftool/dbconfig/20250831-071102-ladsgroup.json [07:11:08] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [07:26:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P82257 and previous config saved to /var/cache/conftool/dbconfig/20250831-072610-ladsgroup.json [07:29:36] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:29:37] (03PS2) 10Novem Linguae: data.yaml: change wiki replica to mediawiki replica [puppet] - 10https://gerrit.wikimedia.org/r/1183247 [07:41:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P82258 and previous config saved to /var/cache/conftool/dbconfig/20250831-074117-ladsgroup.json [07:48:59] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11134368 (10phaultfinder) [07:53:54] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11134369 (10phaultfinder) [07:56:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T402925)', diff saved to https://phabricator.wikimedia.org/P82259 and previous config saved to /var/cache/conftool/dbconfig/20250831-075624-ladsgroup.json [07:56:31] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [07:56:41] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2189.codfw.wmnet with reason: Maintenance [07:56:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2189 (T402925)', diff saved to https://phabricator.wikimedia.org/P82260 and previous config saved to /var/cache/conftool/dbconfig/20250831-075648-ladsgroup.json [08:23:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T402925)', diff saved to https://phabricator.wikimedia.org/P82261 and previous config saved to /var/cache/conftool/dbconfig/20250831-082301-ladsgroup.json [08:23:07] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [08:38:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P82262 and previous config saved to /var/cache/conftool/dbconfig/20250831-083808-ladsgroup.json [08:53:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P82263 and previous config saved to /var/cache/conftool/dbconfig/20250831-085316-ladsgroup.json [09:03:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T402925)', diff saved to https://phabricator.wikimedia.org/P82264 and previous config saved to /var/cache/conftool/dbconfig/20250831-090824-ladsgroup.json [09:08:30] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [09:08:40] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [09:19:36] FIRING: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [09:24:40] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and asw1-by27-esams (185.15.59.155) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:24:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:et-1/0/0 (Core: asw1-bw27-esams:et-0/0/48 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:28:40] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:29:39] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:36:21] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2204.codfw.wmnet with reason: Maintenance [09:36:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2204 (T402925)', diff saved to https://phabricator.wikimedia.org/P82265 and previous config saved to /var/cache/conftool/dbconfig/20250831-093628-ladsgroup.json [09:36:34] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [09:37:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T402925)', diff saved to https://phabricator.wikimedia.org/P82266 and previous config saved to /var/cache/conftool/dbconfig/20250831-093753-ladsgroup.json [09:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:07:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:23:29] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2225.codfw.wmnet with reason: Maintenance [10:23:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2225 (T402925)', diff saved to https://phabricator.wikimedia.org/P82267 and previous config saved to /var/cache/conftool/dbconfig/20250831-102336-ladsgroup.json [10:23:42] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [10:33:35] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: CRITICAL - Host Unreachable (2a00:1188:5:e::4) [10:33:49] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [10:36:07] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: CRITICAL - Host Unreachable (2607:fb58:9000:7::2) [10:36:27] PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100% [10:36:29] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [10:37:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-eqiad and Arelion (2001:2035:0:a98::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [10:38:37] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 82.31 ms [10:38:51] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 91.57 ms [10:41:09] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 64.41 ms [10:41:29] RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 86.79 ms [10:41:31] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 66.50 ms [10:49:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:53:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:54:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:59:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:02:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:20:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P82270 and previous config saved to /var/cache/conftool/dbconfig/20250831-112035-ladsgroup.json [11:35:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T402925)', diff saved to https://phabricator.wikimedia.org/P82271 and previous config saved to /var/cache/conftool/dbconfig/20250831-113542-ladsgroup.json [11:35:49] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:35:59] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2226.codfw.wmnet with reason: Maintenance [11:36:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2226 (T402925)', diff saved to https://phabricator.wikimedia.org/P82272 and previous config saved to /var/cache/conftool/dbconfig/20250831-113606-ladsgroup.json [11:38:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T402925)', diff saved to https://phabricator.wikimedia.org/P82273 and previous config saved to /var/cache/conftool/dbconfig/20250831-113830-ladsgroup.json [11:53:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P82274 and previous config saved to /var/cache/conftool/dbconfig/20250831-115338-ladsgroup.json [11:53:52] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11134461 (10phaultfinder) [11:58:54] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11134465 (10phaultfinder) [12:02:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:08:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P82275 and previous config saved to /var/cache/conftool/dbconfig/20250831-120846-ladsgroup.json [12:23:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T402925)', diff saved to https://phabricator.wikimedia.org/P82276 and previous config saved to /var/cache/conftool/dbconfig/20250831-122353-ladsgroup.json [12:24:00] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [12:24:09] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2238.codfw.wmnet with reason: Maintenance [12:24:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2238 (T402925)', diff saved to https://phabricator.wikimedia.org/P82277 and previous config saved to /var/cache/conftool/dbconfig/20250831-122416-ladsgroup.json [12:51:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T402925)', diff saved to https://phabricator.wikimedia.org/P82278 and previous config saved to /var/cache/conftool/dbconfig/20250831-125150-ladsgroup.json [12:51:57] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [12:58:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:02:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P82279 and previous config saved to /var/cache/conftool/dbconfig/20250831-130657-ladsgroup.json [13:07:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-eqiad and Arelion (2001:2035:0:a98::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:19:36] FIRING: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [13:22:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P82280 and previous config saved to /var/cache/conftool/dbconfig/20250831-132205-ladsgroup.json [13:25:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:et-1/0/0 (Core: asw1-bw27-esams:et-0/0/48 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:28:40] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:29:54] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:37:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T402925)', diff saved to https://phabricator.wikimedia.org/P82281 and previous config saved to /var/cache/conftool/dbconfig/20250831-133713-ladsgroup.json [13:37:19] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [13:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [14:53:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:02:23] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:02:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:08:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:28:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:36] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:41:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:51:49] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:58:53] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11134579 (10phaultfinder) [16:02:23] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:02:56] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:57] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11134580 (10phaultfinder) [17:19:36] FIRING: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [17:25:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:et-1/0/0 (Core: asw1-bw27-esams:et-0/0/48 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:29:36] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:29:54] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [18:53:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:20:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:21:34] (03PS2) 10D3r1ck01: session: Enable MultiBackendSessionStore on `group0` wikis only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) [19:21:38] (03CR) 10D3r1ck01: session: Enable MultiBackendSessionStore on `group0` wikis only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [19:25:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:40:21] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Tue 16 Sep 2025 07:40:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [20:03:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:00] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11134636 (10phaultfinder) [20:08:55] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11134637 (10phaultfinder) [20:10:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:26:27] (03PS1) 10Ladsgroup: Drop support for categorylinks read old [extensions/CategoryTree] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183269 (https://phabricator.wikimedia.org/T299951) [20:29:07] (03CR) 10CI reject: [V:04-1] Drop support for categorylinks read old [extensions/CategoryTree] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183269 (https://phabricator.wikimedia.org/T299951) (owner: 10Ladsgroup) [20:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:19:36] FIRING: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [21:25:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:et-1/0/0 (Core: asw1-bw27-esams:et-0/0/48 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:29:36] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:29:54] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [22:53:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:56:55] (03PS16) 10Krinkle: varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 (https://phabricator.wikimedia.org/T401595) [22:56:57] (03PS2) 10Krinkle: varnish: Remove 60s cap for mobileaction/useformat on m-dot [puppet] - 10https://gerrit.wikimedia.org/r/1183212 (https://phabricator.wikimedia.org/T401595) [23:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:25:10] (03PS1) 10Krinkle: varnish: remove unused allowed_methods /hieradata/role/common/cache/text.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1183274 (https://phabricator.wikimedia.org/T392073) [23:25:22] (03PS1) 10Krinkle: beta: Update hieradata for fe_vcl_config from Horizon [puppet] - 10https://gerrit.wikimedia.org/r/1183275 [23:32:04] (03PS2) 10Krinkle: beta: Update hieradata for fe_vcl_config from Horizon [puppet] - 10https://gerrit.wikimedia.org/r/1183275 [23:34:57] (03CR) 10Krinkle: "I've applied this to beta, and ran puppet on both varnish hosts, no-op. I then removed the now-redundant hieradata override in Horizon, an" [puppet] - 10https://gerrit.wikimedia.org/r/1183275 (owner: 10Krinkle) [23:37:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1183276 [23:37:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1183276 (owner: 10TrainBranchBot) [23:52:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1183276 (owner: 10TrainBranchBot)