[00:02:43] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:04:43] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:04:55] FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1074548 (owner: 10TrainBranchBot) [00:09:58] (03PS1) 10Stoyofuku-wmf: Deploy donate link to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) [00:10:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [00:25:02] (03PS1) 10Dwisehaupt: icinga: add cert expiration monitoring for apis [puppet] - 10https://gerrit.wikimedia.org/r/1074552 (https://phabricator.wikimedia.org/T348725) [00:26:24] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375314 (10phaultfinder) 03NEW [00:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [02:38:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:53:11] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - commonswiki_file_1716985503[22](2024-09-17T17:00:15.196Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [02:59:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:04:55] FIRING: [3x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:09:55] FIRING: [3x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:55] FIRING: [3x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [08:14:55] FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [09:01:48] (03CR) 10Ebrahim: "This doesn't have any effect on English Wikisource or multilingual Wikisource among many other Wikisources because they uses different nam" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim) [12:15:10] FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:05] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10165637 (10phaultfinder) [12:54:11] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:55:01] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [13:14:08] (03CR) 10Gergő Tisza: [C:03+1] "Maybe worth writing out the beta and prod domain names directly - I think the tests don't have access to the globals." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński) [13:25:11] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10165662 (10phaultfinder) [13:41:06] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10165664 (10dcaro) Not yet, I'm still draining the C8 rack, early next week I'll have something [14:38:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:07] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [15:16:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [15:40:05] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10165707 (10phaultfinder) [16:04:55] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [17:04:55] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:22:24] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328 (10phaultfinder) 03NEW [18:35:10] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10165771 (10phaultfinder) [20:56:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [21:05:10] FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:23:15] PROBLEM - Disk space on thanos-be2004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 203356 MB (5% inode=91%): /srv/swift-storage/sdg1 198047 MB (5% inode=92%): /srv/swift-storage/sdc1 150279 MB (3% inode=90%): /srv/swift-storage/sdh1 201548 MB (5% inode=92%): /srv/swift-storage/sde1 196895 MB (5% inode=91%): /srv/swift-storage/sdd1 177584 MB (4% inode=92%): /srv/swift-storage/sdj1 208324 MB (5% inode=91%): /srv/swift-st [21:23:15] k1 166061 MB (4% inode=91%): /srv/swift-storage/sdi1 216362 MB (5% inode=92%): /srv/swift-storage/sdl1 208991 MB (5% inode=91%): /srv/swift-storage/sdn1 228699 MB (5% inode=91%): /srv/swift-storage/sdm1 211876 MB (5% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [23:03:15] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdh1 188777 MB (4% inode=92%): /srv/swift-storage/sdc1 198935 MB (5% inode=91%): /srv/swift-storage/sdf1 216534 MB (5% inode=92%): /srv/swift-storage/sdg1 202661 MB (5% inode=91%): /srv/swift-storage/sdd1 179522 MB (4% inode=91%): /srv/swift-storage/sde1 206070 MB (5% inode=92%): /srv/swift-storage/sdi1 188520 MB (4% inode=91%): /srv/swift-st [23:03:15] k1 179650 MB (4% inode=91%): /srv/swift-storage/sdj1 183709 MB (4% inode=91%): /srv/swift-storage/sdl1 171076 MB (4% inode=91%): /srv/swift-storage/sdm1 199820 MB (5% inode=91%): /srv/swift-storage/sdn1 151388 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1074707 [23:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1074707 (owner: 10TrainBranchBot)