[00:02:45] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:04:47] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:09:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1074707 (owner: 10TrainBranchBot) [00:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:05:25] FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:37:53] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:38:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:15] PROBLEM - Disk space on thanos-be1001 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 205487 MB (5% inode=92%): /srv/swift-storage/sdg1 192765 MB (5% inode=91%): /srv/swift-storage/sdc1 188542 MB (4% inode=92%): /srv/swift-storage/sdi1 172745 MB (4% inode=91%): /srv/swift-storage/sde1 192299 MB (5% inode=91%): /srv/swift-storage/sdh1 179129 MB (4% inode=91%): /srv/swift-storage/sdj1 204970 MB (5% inode=92%): /srv/swift-st [03:03:15] k1 172033 MB (4% inode=91%): /srv/swift-storage/sdd1 152459 MB (3% inode=90%): /srv/swift-storage/sdm1 171015 MB (4% inode=91%): /srv/swift-storage/sdl1 176797 MB (4% inode=91%): /srv/swift-storage/sdn1 183649 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops [03:43:15] PROBLEM - Disk space on thanos-be2003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 195286 MB (5% inode=91%): /srv/swift-storage/sde1 190339 MB (4% inode=91%): /srv/swift-storage/sdh1 173430 MB (4% inode=91%): /srv/swift-storage/sdc1 186172 MB (4% inode=91%): /srv/swift-storage/sdd1 178239 MB (4% inode=91%): /srv/swift-storage/sdg1 180244 MB (4% inode=91%): /srv/swift-storage/sdi1 193185 MB (5% inode=91%): /srv/swift-st [03:43:15] j1 187253 MB (4% inode=91%): /srv/swift-storage/sdk1 151578 MB (3% inode=91%): /srv/swift-storage/sdl1 177268 MB (4% inode=91%): /srv/swift-storage/sdm1 172198 MB (4% inode=92%): /srv/swift-storage/sdn1 170166 MB (4% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [03:54:55] FIRING: [2x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:55] FIRING: [3x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:09:55] FIRING: [3x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:55] FIRING: [3x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:12] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10165904 (10phaultfinder) [04:40:10] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10165911 (10phaultfinder) [04:41:11] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 215, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:41:11] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:03:15] PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdg1 186399 MB (4% inode=91%): /srv/swift-storage/sdd1 201879 MB (5% inode=92%): /srv/swift-storage/sdc1 172773 MB (4% inode=91%): /srv/swift-storage/sdf1 172348 MB (4% inode=91%): /srv/swift-storage/sdh1 165601 MB (4% inode=91%): /srv/swift-storage/sdi1 159525 MB (4% inode=91%): /srv/swift-storage/sde1 188172 MB (4% inode=92%): /srv/swift-st [05:03:15] j1 180628 MB (4% inode=91%): /srv/swift-storage/sdk1 192416 MB (5% inode=91%): /srv/swift-storage/sdm1 179315 MB (4% inode=91%): /srv/swift-storage/sdl1 181062 MB (4% inode=91%): /srv/swift-storage/sdn1 151165 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [05:10:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10165915 (10phaultfinder) [05:13:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:13:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:23:15] PROBLEM - Disk space on thanos-be1002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 169704 MB (4% inode=91%): /srv/swift-storage/sdc1 151108 MB (3% inode=90%): /srv/swift-storage/sdf1 184532 MB (4% inode=91%): /srv/swift-storage/sdd1 169167 MB (4% inode=91%): /srv/swift-storage/sdg1 170748 MB (4% inode=91%): /srv/swift-storage/sdh1 179631 MB (4% inode=92%): /srv/swift-storage/sdi1 199305 MB (5% inode=92%): /srv/swift-st [06:23:15] j1 178710 MB (4% inode=91%): /srv/swift-storage/sdk1 152876 MB (4% inode=91%): /srv/swift-storage/sdm1 181024 MB (4% inode=91%): /srv/swift-storage/sdn1 161828 MB (4% inode=91%): /srv/swift-storage/sdl1 163531 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops [06:25:11] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10165923 (10phaultfinder) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240922T0700) [07:41:04] (03PS1) 10Elukey: profile::puppetserver: fix SHA1 path for labsprivate [puppet] - 10https://gerrit.wikimedia.org/r/1074713 (https://phabricator.wikimedia.org/T374443) [07:43:15] PROBLEM - Disk space on thanos-be1004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 177544 MB (4% inode=92%): /srv/swift-storage/sdc1 173496 MB (4% inode=91%): /srv/swift-storage/sdh1 186540 MB (4% inode=91%): /srv/swift-storage/sdd1 156117 MB (4% inode=90%): /srv/swift-storage/sdf1 154464 MB (4% inode=90%): /srv/swift-storage/sdg1 202921 MB (5% inode=91%): /srv/swift-storage/sdi1 151933 MB (3% inode=90%): /srv/swift-st [07:43:15] j1 167561 MB (4% inode=92%): /srv/swift-storage/sdl1 182702 MB (4% inode=92%): /srv/swift-storage/sdk1 186828 MB (4% inode=91%): /srv/swift-storage/sdm1 181714 MB (4% inode=91%): /srv/swift-storage/sdn1 176688 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1004&var-datasource=eqiad+prometheus/ops [07:44:02] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4073/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074713 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [08:03:15] PROBLEM - Disk space on thanos-be2002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdd1 166786 MB (4% inode=91%): /srv/swift-storage/sdc1 159213 MB (4% inode=91%): /srv/swift-storage/sdg1 152731 MB (4% inode=90%): /srv/swift-storage/sdh1 171667 MB (4% inode=91%): /srv/swift-storage/sde1 166876 MB (4% inode=91%): /srv/swift-storage/sdi1 154631 MB (4% inode=90%): /srv/swift-storage/sdj1 174284 MB (4% inode=92%): /srv/swift-st [08:03:15] k1 194934 MB (5% inode=91%): /srv/swift-storage/sdl1 153185 MB (4% inode=91%): /srv/swift-storage/sdm1 178556 MB (4% inode=91%): /srv/swift-storage/sdn1 170089 MB (4% inode=91%): /srv/swift-storage/sdf1 152512 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [08:14:55] FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [11:09:21] FIRING: CirrusSearchFullTextLatencyTooHigh: ... [11:09:26] CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [11:11:21] FIRING: CirrusSearchMoreLikeLatencyTooHigh: ... [11:11:26] CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [11:18:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:23:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:23:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.06% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:24:00] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:24:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 18.39% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:28:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 22.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:36:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: ... [11:36:21] CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [11:39:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: ... [11:39:21] CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [12:15:11] FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:16] PROBLEM - Host cr3-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [12:17:44] RECOVERY - Host cr3-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 73.89 ms [12:24:05] ^ this has recovered and seems to be T374401 but for ulsfo now [12:24:05] T374401: Transient DOWN alert on cr2-magru - https://phabricator.wikimedia.org/T374401 [12:24:25] and again and similar to above, no recovery on victorops [12:24:31] ah it just came ok [12:24:40] It just resolved [12:24:47] !incidents [12:24:47] 5264 (RESOLVED) Host cr3-ulsfo - PING - Packet loss = 100% [12:24:48] * sukhe goes back to Sun [12:25:12] * jelto +1 [12:26:07] yes seems similar to that magru one [12:29:58] or perhaps not - more like some kind of hardware fault on the router perhaps [12:46:33] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345 (10cmooney) 03NEW p:05Triage→03High [12:48:17] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10166125 (10cmooney) [12:49:43] Looks stable enough - we can dig more into the logs and perhaps raise a case with Juniper tomorrow [12:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:58:49] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:00:14] PROBLEM - Host cr3-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [13:00:45] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 62, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:01:14] RECOVERY - Host cr3-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.93 ms [13:01:45] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:03:37] ugh :( [13:14:22] !log vgutierrez@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site ulsfo [reason: cr3-ulsfo issues, no task ID specified] [13:14:30] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site ulsfo [reason: cr3-ulsfo issues, no task ID specified] [13:15:22] !log depooled ulsfo due to cr3-ulsfo issues [13:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:21] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt with reason: cr3-ulsfo fpc restart and hw instability [13:16:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt with reason: cr3-ulsfo fpc restart and hw instability [13:16:45] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10166128 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5186e097-9b87-468b-abf6-b7a7fcd918c6) set by cmooney@cumin1002 for 1 day, 0:00:00 on 3 host(s) and their servi... [13:18:11] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10166129 (10cmooney) This is after re-occuring, approx 44 minutes after the first time, first logs mention this: ` Sep 22 12:58:28 cr3-ulsfo fpc0 listmgr_host_xtxn_idle(1526): EA[0:0].ll... [13:21:10] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10166131 (10cmooney) Logs from second time are in P69385 [14:38:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:43:41] PROBLEM - BFD status on cr1-esams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:59:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:41] RECOVERY - BFD status on cr1-esams is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:25:11] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:26:01] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.233 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:15:26] FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:30:09] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10166164 (10phaultfinder) [16:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [19:03:27] 06SRE, 06Trust-and-Safety, 10Wikimedia-Mailing-lists: Create a mailing list for frwiki CU - https://phabricator.wikimedia.org/T375347#10166254 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I created the mailing list under another name: https://lists.wikimedia.org/postorius/lists/wikipedia-fr-checku... [19:09:26] 06SRE, 06Trust-and-Safety, 10Wikimedia-Mailing-lists: Create a mailing list for frwiki CU - https://phabricator.wikimedia.org/T375347#10166257 (10LD) Thanks Lagsgroup. No worries about SUL/phab, I got the email. I'm gonna reach T&S for more feedback. [19:10:11] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10166258 (10phaultfinder) [20:15:26] FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:42:11] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:42:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:43:45] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:47:39] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:48:01] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:48:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:51:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074311 (owner: 10DErenrich) [20:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [22:30:11] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10166285 (10phaultfinder) [23:23:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [23:23:44] Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.eqiad.main - ... [23:23:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:28:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [23:28:44] Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.eqiad.main - ... [23:28:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:38:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1074730 [23:38:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1074730 (owner: 10TrainBranchBot)