[00:00:39] JJMC89: is it gone? [00:01:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048591 (owner: 10TrainBranchBot) [00:02:00] yes - I saw lag recovery in line with the alert above [00:11:18] (03CR) 10Cwhite: [C:03+2] logstash: move thumbor logs to logstash-thumbor partition [puppet] - 10https://gerrit.wikimedia.org/r/1048592 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite) [00:18:36] JJMC89: ok, good. thanks [00:59:29] (03PS1) 10Cwhite: logstash: reduce thumbor replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048605 (https://phabricator.wikimedia.org/T368180) [01:00:27] (03CR) 10Cwhite: [C:03+2] logstash: reduce thumbor replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048605 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite) [01:04:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T364069)', diff saved to https://phabricator.wikimedia.org/P65335 and previous config saved to /var/cache/conftool/dbconfig/20240622-010436-marostegui.json [01:04:42] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [01:19:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P65336 and previous config saved to /var/cache/conftool/dbconfig/20240622-011943-marostegui.json [01:34:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P65337 and previous config saved to /var/cache/conftool/dbconfig/20240622-013451-marostegui.json [01:49:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T364069)', diff saved to https://phabricator.wikimedia.org/P65338 and previous config saved to /var/cache/conftool/dbconfig/20240622-014958-marostegui.json [01:50:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [01:50:04] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [01:50:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [01:50:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T364069)', diff saved to https://phabricator.wikimedia.org/P65339 and previous config saved to /var/cache/conftool/dbconfig/20240622-015020-marostegui.json [02:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:47] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:47] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:14:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:49:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:45:15] PROBLEM - mysqld processes on db2197 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:45:19] PROBLEM - MariaDB Replica IO: x1 on db2197 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:45:19] PROBLEM - MariaDB Replica SQL: s2 on db2197 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:45:19] PROBLEM - MariaDB Replica IO: s2 on db2197 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:45:19] PROBLEM - MariaDB Replica IO: s6 on db2197 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:45:19] PROBLEM - MariaDB Replica SQL: s6 on db2197 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:45:20] PROBLEM - MariaDB Replica SQL: x1 on db2197 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:45:27] PROBLEM - MariaDB read only x1 on db2197 is CRITICAL: Could not connect to localhost:3320 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:45:27] PROBLEM - MariaDB read only s6 on db2197 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:45:27] PROBLEM - MariaDB read only s2 on db2197 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:01:09] what? [05:01:14] That is a backup host, let me check [05:01:33] it rebooted itself looks like [05:04:58] ACKNOWLEDGEMENT - MariaDB Replica IO: s2 on db2197 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui https://phabricator.wikimedia.org/T368189 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:04:58] ACKNOWLEDGEMENT - MariaDB Replica IO: s6 on db2197 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui https://phabricator.wikimedia.org/T368189 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:04:58] ACKNOWLEDGEMENT - MariaDB Replica IO: x1 on db2197 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui https://phabricator.wikimedia.org/T368189 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:04:58] ACKNOWLEDGEMENT - MariaDB Replica SQL: s2 on db2197 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui https://phabricator.wikimedia.org/T368189 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:04:58] ACKNOWLEDGEMENT - MariaDB Replica SQL: s6 on db2197 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui https://phabricator.wikimedia.org/T368189 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:04:58] ACKNOWLEDGEMENT - MariaDB Replica SQL: x1 on db2197 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui https://phabricator.wikimedia.org/T368189 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:04:58] ACKNOWLEDGEMENT - MariaDB read only s2 on db2197 is CRITICAL: Could not connect to localhost:3312 Marostegui https://phabricator.wikimedia.org/T368189 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:04:59] ACKNOWLEDGEMENT - MariaDB read only s6 on db2197 is CRITICAL: Could not connect to localhost:3316 Marostegui https://phabricator.wikimedia.org/T368189 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:04:59] ACKNOWLEDGEMENT - MariaDB read only x1 on db2197 is CRITICAL: Could not connect to localhost:3320 Marostegui https://phabricator.wikimedia.org/T368189 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:05:00] ACKNOWLEDGEMENT - mysqld processes on db2197 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Marostegui https://phabricator.wikimedia.org/T368189 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:05:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2197.codfw.wmnet with reason: Long schema change [05:05:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2197.codfw.wmnet with reason: Long schema change [06:02:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T364069)', diff saved to https://phabricator.wikimedia.org/P65340 and previous config saved to /var/cache/conftool/dbconfig/20240622-060216-marostegui.json [06:02:24] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:17:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P65341 and previous config saved to /var/cache/conftool/dbconfig/20240622-061725-marostegui.json [06:32:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P65342 and previous config saved to /var/cache/conftool/dbconfig/20240622-063232-marostegui.json [06:47:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T364069)', diff saved to https://phabricator.wikimedia.org/P65343 and previous config saved to /var/cache/conftool/dbconfig/20240622-064739-marostegui.json [06:47:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [06:47:46] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [06:47:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [06:48:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T364069)', diff saved to https://phabricator.wikimedia.org/P65344 and previous config saved to /var/cache/conftool/dbconfig/20240622-064802-marostegui.json [07:13:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:14:41] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:15:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:31] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:26:39] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:39] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:31] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 328.83 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:14:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:14:28] (03CR) 10AOkoth: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [08:14:31] (03CR) 10AOkoth: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [08:42:31] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:36:05] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 102 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:46:03] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 13 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:50:11] (03PS78) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [10:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:06:54] (03CR) 10Zabe: [ltwiki] Add a new 'rollbacker' usergroup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048408 (https://phabricator.wikimedia.org/T367993) (owner: 10Superpes15) [11:08:58] (03PS2) 10Superpes15: [ltwiki] Add a new 'rollbacker' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048408 (https://phabricator.wikimedia.org/T367993) [11:09:59] (03CR) 10Superpes15: [ltwiki] Add a new 'rollbacker' usergroup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048408 (https://phabricator.wikimedia.org/T367993) (owner: 10Superpes15) [11:18:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T364069)', diff saved to https://phabricator.wikimedia.org/P65345 and previous config saved to /var/cache/conftool/dbconfig/20240622-111842-marostegui.json [11:18:49] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:33:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P65346 and previous config saved to /var/cache/conftool/dbconfig/20240622-113350-marostegui.json [11:41:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:46:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:48:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P65347 and previous config saved to /var/cache/conftool/dbconfig/20240622-114857-marostegui.json [11:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:01:26] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:03:45] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:04:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T364069)', diff saved to https://phabricator.wikimedia.org/P65348 and previous config saved to /var/cache/conftool/dbconfig/20240622-120404-marostegui.json [12:04:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [12:04:12] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [12:04:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [12:04:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T364069)', diff saved to https://phabricator.wikimedia.org/P65349 and previous config saved to /var/cache/conftool/dbconfig/20240622-120437-marostegui.json [12:06:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:32:45] FIRING: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:48:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:52:45] RESOLVED: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:53:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:53:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:55:31] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:56:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:02:31] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:03:45] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:06:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:31] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 429.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:53:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:54:31] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:47] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:48] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:31] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:08:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:10:31] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:19:27] Hello. I'm unable to load Wikipedia as of this morning. My connection keeps timing out [15:19:42] I was wondering if any ops can comment. [15:20:03] Rest of the internet works fine. [15:29:56] !help ^ [15:29:56] want docs? ask for "!wm-bot". all keywords? try "@regsearch .*" [15:29:56] You're not allowed to perform this action. [15:33:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T364069)', diff saved to https://phabricator.wikimedia.org/P65350 and previous config saved to /var/cache/conftool/dbconfig/20240622-153318-marostegui.json [15:33:24] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [15:48:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P65351 and previous config saved to /var/cache/conftool/dbconfig/20240622-154826-marostegui.json [15:50:39] Cyberpower678: I suspect there may still be something locally hampering your connectivity. What about a different network, etc? [15:50:58] My mobile network works. [15:51:16] Nothing on my network is indicating that it's blocking the connection though. [15:51:36] It's like Wikimedia servers are dropping my connection [15:52:11] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:52:17] brett ^ [15:52:29] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:52:37] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.25 ms [15:52:45] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.64 ms [15:56:53] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 813858944 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:59:13] Whelp, traceroute is showing that it's leaving my network and then stopping after the 5th hop. :/ [15:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:53] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7368 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:03:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P65352 and previous config saved to /var/cache/conftool/dbconfig/20240622-160333-marostegui.json [16:04:54] brett, think you can help investigate? My tracerouteV6 indicates it's able to trace all the way to en.wikipedia.org where it gets a 100% packet loss on Wikipedia. [16:08:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:18:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T364069)', diff saved to https://phabricator.wikimedia.org/P65353 and previous config saved to /var/cache/conftool/dbconfig/20240622-161841-marostegui.json [16:18:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [16:18:48] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [16:18:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [16:36:31] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 328.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:45:00] Cyberpower678: Unfortunately, I'm pretty full right now and I don't think I'd be particularly useful [16:45:36] Yea, I'm poking around. But right now it's looking less like a local network issue. [17:06:40] FIRING: SystemdUnitFailed: php7.4-fpm_check_restart.service on mw1446:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:31] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 48.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:54:25] (03PS4) 10Gergő Tisza: [beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) [17:57:13] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:57:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:59:17] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:00:05] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52196 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:00:07] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:00:17] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.882 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:02:31] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:02:59] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.65 ms [18:05:48] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:55] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:29] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.76 ms [18:41:46] (03CR) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [18:42:02] (03PS16) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [18:42:02] (03PS1) 10Gergő Tisza: [noop] Remove $wgRedirectScript, not used since MediaWiki 1.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048855 [18:57:02] (03PS17) 10Gergő Tisza: Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [19:13:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:22:59] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:23:59] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.62 ms [19:28:11] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:29:05] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [19:29:15] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.64 ms [19:29:19] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.41 ms [19:33:01] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:34:01] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.97 ms [19:53:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [20:28:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [20:53:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:06:40] FIRING: SystemdUnitFailed: php7.4-fpm_check_restart.service on mw1446:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:22:44] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368209 (10phaultfinder) 03NEW [21:22:45] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210 (10phaultfinder) 03NEW [21:22:46] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368208 (10phaultfinder) 03NEW [21:27:47] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9915424 (10phaultfinder) [21:27:48] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9915425 (10phaultfinder) [21:32:47] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T368210#9915435 (10phaultfinder) [21:43:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:58:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:03:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:05:48] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:13:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 12.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:19:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 9.622s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:19:57] FIRING: [7x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:20:13] PROBLEM - MariaDB Replica Lag: s4 on db1155 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 608.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:15] FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:20:29] PROBLEM - MariaDB Replica Lag: s4 on db2179 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 624.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:31] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:38] PROBLEM - MariaDB Replica Lag: s4 #page on db1199 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 631.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:38] PROBLEM - MariaDB Replica Lag: s4 on dbstore1007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 631.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:38] PROBLEM - MariaDB Replica Lag: s4 #page on db1190 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:39] PROBLEM - MariaDB Replica Lag: s4 on db2199 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:42] PROBLEM - MariaDB Replica Lag: s4 #page on db1160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.63 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:20:45] PROBLEM - MariaDB Replica Lag: s4 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:46] PROBLEM - MariaDB Replica Lag: s4 #page on db1248 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:47] PROBLEM - MariaDB Replica Lag: s4 #page on db1221 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.67 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:48] PROBLEM - MariaDB Replica Lag: s4 #page on db1244 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:48] FIRING: [3x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:20:49] PROBLEM - MariaDB Replica Lag: s4 #page on db1242 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:50] PROBLEM - MariaDB Replica Lag: s4 #page on db1249 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:50] PROBLEM - MariaDB Replica Lag: s4 #page on db1247 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:51] PROBLEM - MariaDB Replica Lag: s4 #page on db1243 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:52] PROBLEM - MariaDB Replica Lag: s4 #page on db1241 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:21:00] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 654.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:21:00] PROBLEM - MariaDB Replica Lag: s4 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:21:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:23:11] RECOVERY - MariaDB Replica Lag: s4 #page on db1160 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:23:15] RECOVERY - MariaDB Replica Lag: s4 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:23:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 7.531% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:23:42] RECOVERY - MariaDB Replica Lag: s4 #page on db1248 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:23:48] FIRING: [20x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:24:13] RECOVERY - MariaDB Replica Lag: s4 #page on db1221 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:24:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 4.865s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:24:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 4.865s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:24:43] RECOVERY - MariaDB Replica Lag: s4 #page on db1249 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:24:57] FIRING: [18x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:25:03] FIRING: [18x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:25:09] RECOVERY - MariaDB Replica Lag: s4 #page on db1247 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:15] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:25:22] RECOVERY - MariaDB Replica Lag: s4 #page on db1241 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:23] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:25:23] PROBLEM - Host mwlog1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:25:32] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:33] FIRING: [3x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:25:34] RECOVERY - MariaDB Replica Lag: s4 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:37] FIRING: [5x] ProbeDown: Service miscweb1003:30443 has failed probes (http_annual_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:25:46] RECOVERY - MariaDB Replica Lag: s4 on db1155 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:47] FIRING: [5x] ProbeDown: Service miscweb1003:30443 has failed probes (http_annual_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:25:52] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:25:57] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:26:00] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:00] RECOVERY - MariaDB Replica Lag: s4 on db2179 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:04] RECOVERY - MariaDB Replica Lag: s4 on dbstore1007 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:19] RECOVERY - MariaDB Replica Lag: s4 #page on db1199 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:27] RECOVERY - MariaDB Replica Lag: s4 #page on db1190 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:27] RECOVERY - MariaDB Replica Lag: s4 on db2199 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:44] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [22:26:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:26:46] RECOVERY - MariaDB Replica Lag: s4 #page on db1243 is OK: OK slave_sql_lag Replication lag: 1.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:51] FIRING: [5x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:26:56] RECOVERY - MariaDB Replica Lag: s4 #page on db1242 is OK: OK slave_sql_lag Replication lag: 0.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:27:37] FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [22:27:59] RECOVERY - MariaDB Replica Lag: s4 #page on db1244 is OK: OK slave_sql_lag Replication lag: 0.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:28:03] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 472 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 811, active_shards: 1149, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 466, delayed_unassigned_shards: 0, number_of_pending [22:28:03] 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 201, active_shards_percent_as_number: 70.88217149907464 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:05] PROBLEM - OpenSearch health check for shards on 9200 on logstash1028 is CRITICAL: CRITICAL - elasticsearch inactive shards 701 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 725, active_shards: 1027, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 693 [22:28:05] d_unassigned_shards: 0, number_of_pending_tasks: 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 257591, active_shards_percent_as_number: 59.432870370370374 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:06] PROBLEM - OpenSearch health check for shards on 9200 on logstash1029 is CRITICAL: CRITICAL - elasticsearch inactive shards 701 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 725, active_shards: 1027, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 693 [22:28:06] d_unassigned_shards: 0, number_of_pending_tasks: 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 257623, active_shards_percent_as_number: 59.432870370370374 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:07] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 466 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 811, active_shards: 1155, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 463, delayed_unassigned_shards: 0, number_of_pending [22:28:07] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 83, active_shards_percent_as_number: 71.25231338679828 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:11] PROBLEM - OpenSearch health check for shards on 9200 on logstash1024 is CRITICAL: CRITICAL - elasticsearch inactive shards 683 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 730, active_shards: 1045, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 679 [22:28:11] d_unassigned_shards: 0, number_of_pending_tasks: 9, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 262038, active_shards_percent_as_number: 60.47453703703704 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:11] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1397 threshold =0.2 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, active_primary_shards: 1379, active_shards: 2794, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 1390, delayed_unassigned_shards: 0, number_of_ [22:28:11] tasks: 14, number_of_in_flight_fetch: 378, task_max_waiting_in_queue_millis: 232311, active_shards_percent_as_number: 66.66666666666666 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:13] PROBLEM - OpenSearch health check for shards on 9200 on logstash1034 is CRITICAL: CRITICAL - elasticsearch inactive shards 673 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1055, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 66 [22:28:13] ed_unassigned_shards: 0, number_of_pending_tasks: 5, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 266067, active_shards_percent_as_number: 61.05324074074075 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:15] PROBLEM - Etcd cluster health on dse-k8s-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [22:28:15] PROBLEM - OpenSearch health check for shards on 9200 on logstash1033 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:15] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 446 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 811, active_shards: 1175, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 442, delayed_unassigned_shards: 0, number_of_pending [22:28:15] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 72.48611967921036 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:15] PROBLEM - OpenSearch health check for shards on 9200 on logstash1036 is CRITICAL: CRITICAL - elasticsearch inactive shards 673 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1055, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 66 [22:28:15] ed_unassigned_shards: 0, number_of_pending_tasks: 10, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 266928, active_shards_percent_as_number: 61.05324074074075 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:16] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 2.292% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:28:16] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:16] PROBLEM - OpenSearch health check for shards on 9200 on logstash1037 is CRITICAL: CRITICAL - elasticsearch inactive shards 666 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1062, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 661 [22:28:17] d_unassigned_shards: 0, number_of_pending_tasks: 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 268164, active_shards_percent_as_number: 61.458333333333336 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:17] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 444 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 811, active_shards: 1177, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 440, delayed_unassigned_shards: 0, number_of_pending [22:28:18] 5, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 293, active_shards_percent_as_number: 72.60950030845157 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:18] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 442 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 811, active_shards: 1179, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 440, delayed_unassigned_shards: 0, number_of_pending [22:28:19] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 72.73288093769278 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:21] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch inactive shards 655 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1073, relocating_shards: 0, initializing_shards: 10, unassigned_shards: 64 [22:28:21] ed_unassigned_shards: 0, number_of_pending_tasks: 10, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 274273, active_shards_percent_as_number: 62.094907407407405 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:23] PROBLEM - OpenSearch health check for shards on 9200 on logstash1031 is CRITICAL: CRITICAL - elasticsearch inactive shards 650 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1078, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 645 [22:28:23] d_unassigned_shards: 0, number_of_pending_tasks: 5, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 275428, active_shards_percent_as_number: 62.38425925925925 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:25] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:27] PROBLEM - OpenSearch health check for shards on 9200 on logging-hd1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 639 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1089, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 6 [22:28:27] yed_unassigned_shards: 0, number_of_pending_tasks: 9, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 280131, active_shards_percent_as_number: 63.020833333333336 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:27] PROBLEM - OpenSearch health check for shards on 9200 on logstash1035 is CRITICAL: CRITICAL - elasticsearch inactive shards 639 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1089, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 631 [22:28:27] d_unassigned_shards: 0, number_of_pending_tasks: 9, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 280152, active_shards_percent_as_number: 63.020833333333336 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:27] PROBLEM - OpenSearch health check for shards on 9200 on logstash1032 is CRITICAL: CRITICAL - elasticsearch inactive shards 639 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1089, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 631 [22:28:28] d_unassigned_shards: 0, number_of_pending_tasks: 9, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 279993, active_shards_percent_as_number: 63.020833333333336 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:29] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch inactive shards 636 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1092, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 631 [22:28:29] d_unassigned_shards: 0, number_of_pending_tasks: 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 280830, active_shards_percent_as_number: 63.19444444444444 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:29] PROBLEM - OpenSearch health check for shards on 9200 on logstash1026 is CRITICAL: CRITICAL - elasticsearch inactive shards 635 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1093, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 631 [22:28:30] d_unassigned_shards: 0, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1729, active_shards_percent_as_number: 63.25231481481482 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:30] PROBLEM - OpenSearch health check for shards on 9200 on logging-hd1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 635 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1093, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 6 [22:28:31] yed_unassigned_shards: 0, number_of_pending_tasks: 7, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1934, active_shards_percent_as_number: 63.25231481481482 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:31] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:28:32] PROBLEM - OpenSearch health check for shards on 9200 on logstash1030 is CRITICAL: CRITICAL - elasticsearch inactive shards 635 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1093, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 62 [22:28:32] ed_unassigned_shards: 0, number_of_pending_tasks: 10, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2943, active_shards_percent_as_number: 63.25231481481482 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:33] PROBLEM - OpenSearch health check for shards on 9200 on logstash1027 is CRITICAL: CRITICAL - elasticsearch inactive shards 631 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1097, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 623 [22:28:33] d_unassigned_shards: 0, number_of_pending_tasks: 7, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3017, active_shards_percent_as_number: 63.48379629629629 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:34] PROBLEM - OpenSearch health check for shards on 9200 on logging-hd1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 635 threshold =0.34 breach: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 17, number_of_data_nodes: 11, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1093, relocating_shards: 0, initializing_shards: 12, unassigned_shards: [22:28:34] ayed_unassigned_shards: 0, number_of_pending_tasks: 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2534, active_shards_percent_as_number: 63.25231481481482 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:28:37] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:28:41] PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/search AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [22:28:43] PROBLEM - Hadoop Namenode - Primary on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [22:28:47] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:29:04] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:29:05] RECOVERY - OpenSearch health check for shards on 9200 on logstash1028 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1196, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 519, delayed_unassigned_sha [22:29:05] number_of_pending_tasks: 7, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 732, active_shards_percent_as_number: 69.17293233082707 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:05] RECOVERY - OpenSearch health check for shards on 9200 on logstash1029 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1196, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 519, delayed_unassigned_sha [22:29:05] number_of_pending_tasks: 7, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 765, active_shards_percent_as_number: 69.17293233082707 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:07] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 791, active_shards: 1368, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_in_f [22:29:07] tch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.52751423149905 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:07] RECOVERY - OpenSearch health check for shards on 9200 on logstash1033 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1200, relocating_shards: 0, initializing_shards: 10, unassigned_shards: 519, delayed_unassigned_sha [22:29:07] number_of_pending_tasks: 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2316, active_shards_percent_as_number: 69.40427993059572 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:08] FIRING: [26x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:29:13] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, active_primary_shards: 1381, active_shards: 3490, relocating_shards: 0, initializing_shards: 103, unassigned_shards: 598, delayed_unassigned_shards: 0, number_of_pending_tasks: 103, [22:29:13] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 293284, active_shards_percent_as_number: 83.27368169887855 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:15] RECOVERY - OpenSearch health check for shards on 9200 on logstash1034 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1223, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 492, delayed_unassigned_sha [22:29:15] number_of_pending_tasks: 7, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 888, active_shards_percent_as_number: 70.73452862926547 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:15] RECOVERY - OpenSearch health check for shards on 9200 on logstash1024 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1214, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 501, delayed_unassigned_sha [22:29:15] number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 930, active_shards_percent_as_number: 70.213996529786 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:16] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 811, active_shards: 1307, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 307, delayed_unassigned_shards: 0, number_of_pending_tasks: 8, number_of_ [22:29:16] t_fetch: 0, task_max_waiting_in_queue_millis: 852, active_shards_percent_as_number: 80.62924120913017 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:16] RECOVERY - OpenSearch health check for shards on 9200 on logstash1036 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1227, relocating_shards: 0, initializing_shards: 10, unassigned_shards: 492, delayed_unassigned_sha [22:29:17] number_of_pending_tasks: 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1243, active_shards_percent_as_number: 70.96587622903412 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:17] RECOVERY - OpenSearch health check for shards on 9200 on logstash1037 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1232, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 492, delayed_unassigned_shar [22:29:18] umber_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 71.25506072874494 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:21] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 811, active_shards: 1318, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 300, delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_ [22:29:21] t_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.30783466995682 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:21] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 811, active_shards: 1320, relocating_shards: 0, initializing_shards: 9, unassigned_shards: 292, delayed_unassigned_shards: 0, number_of_pending_tasks: 7, number_of_ [22:29:21] t_fetch: 0, task_max_waiting_in_queue_millis: 1092, active_shards_percent_as_number: 81.43121529919803 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:25] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1257, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 468, delayed_unassigned_shar [22:29:25] umber_of_pending_tasks: 2, number_of_in_flight_fetch: 24, task_max_waiting_in_queue_millis: 85, active_shards_percent_as_number: 72.70098322729902 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:27] RECOVERY - OpenSearch health check for shards on 9200 on logstash1031 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1257, relocating_shards: 0, initializing_shards: 14, unassigned_shards: 458, delayed_unassigned_sha [22:29:27] number_of_pending_tasks: 5, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 854, active_shards_percent_as_number: 72.70098322729902 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:31] PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [22:29:31] RECOVERY - OpenSearch health check for shards on 9200 on logstash1032 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1273, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 450, delayed_unassigned_shar [22:29:31] umber_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1121, active_shards_percent_as_number: 73.62637362637363 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:31] RECOVERY - OpenSearch health check for shards on 9200 on logstash1035 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1273, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 450, delayed_unassigned_shar [22:29:31] umber_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1622, active_shards_percent_as_number: 73.62637362637363 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:31] RECOVERY - OpenSearch health check for shards on 9200 on logging-hd1003 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1274, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 450, delayed_unassigned_sh [22:29:32] number_of_pending_tasks: 1, number_of_in_flight_fetch: 24, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 73.68421052631578 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:35] RECOVERY - OpenSearch health check for shards on 9200 on logstash1030 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1281, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 443, delayed_unassigned_shar [22:29:35] umber_of_pending_tasks: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 527, active_shards_percent_as_number: 74.08906882591093 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:35] RECOVERY - OpenSearch health check for shards on 9200 on logging-hd1002 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1281, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 443, delayed_unassigned_sh [22:29:35] number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 211, active_shards_percent_as_number: 74.08906882591093 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:35] RECOVERY - OpenSearch health check for shards on 9200 on logstash1026 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1281, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 443, delayed_unassigned_shar [22:29:35] umber_of_pending_tasks: 1, number_of_in_flight_fetch: 24, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 74.08906882591093 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:36] PROBLEM - CirrusSearch comp_suggest eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [22:29:36] RECOVERY - OpenSearch health check for shards on 9200 on logging-hd1001 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1281, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 443, delayed_unassigned_sh [22:29:37] number_of_pending_tasks: 1, number_of_in_flight_fetch: 24, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 74.08906882591093 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:37] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1281, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 435, delayed_unassigned_sha [22:29:38] number_of_pending_tasks: 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 952, active_shards_percent_as_number: 74.08906882591093 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:38] RECOVERY - OpenSearch health check for shards on 9200 on logstash1027 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 731, active_shards: 1281, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 435, delayed_unassigned_sha [22:29:39] number_of_pending_tasks: 10, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1486, active_shards_percent_as_number: 74.08906882591093 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:29:40] FIRING: [5x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 16.69s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:29:51] PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [22:30:03] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 811, active_shards: 1408, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 212, delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_ [22:30:03] t_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.85996298581122 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:30:04] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [22:30:05] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:30:07] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 811, active_shards: 1414, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 203, delayed_unassigned_shards: 0, number_of_pending_tasks: 4, number_of_ [22:30:07] t_fetch: 0, task_max_waiting_in_queue_millis: 509, active_shards_percent_as_number: 87.23010487353486 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:30:11] FIRING: [41x] ProbeDown: Service dse-k8s-ctrl1001:6443 has failed probes (http_dse_k8s_eqiad_kube_apiserver_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:30:23] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 811, active_shards: 1450, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 171, delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_ [22:30:23] t_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 89.45095619987661 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:30:39] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:30:47] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:30:47] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:30:51] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:30:55] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:31:08] FIRING: [13x] ProbeDown: Service miscweb1003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:31:18] FIRING: [2x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [22:31:26] FIRING: [26x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:31:30] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:31:45] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 479, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:31:59] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:32:09] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 456 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 788, active_shards: 1125, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 450, delayed_unassigned_shards: 0, number_of_pending_task [22:32:09] mber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1841, active_shards_percent_as_number: 71.15749525616698 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:32:12] FIRING: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:32:15] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 448 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 788, active_shards: 1133, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 441, delayed_unassigned_shards: 0, number_of_pending_task [22:32:15] mber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 679, active_shards_percent_as_number: 71.66350411132196 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:32:15] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 447 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 788, active_shards: 1134, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 440, delayed_unassigned_shards: 0, number_of_pending_task [22:32:15] mber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1422, active_shards_percent_as_number: 71.72675521821633 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:32:21] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 440 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 788, active_shards: 1141, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 433, delayed_unassigned_shards: 0, number_of_pending_task [22:32:21] mber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2438, active_shards_percent_as_number: 72.16951296647692 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:32:25] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:32:27] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:32:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on kubernetes1008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:32:47] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [22:32:48] FIRING: [2x] Primary inbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [22:32:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:32:51] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:33:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:33:33] FIRING: KubernetesAPINotScrapable: k8s-dse@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:33:55] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:34:04] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:34:07] FIRING: [18x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:34:09] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 791, active_shards: 1311, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 267, delayed_unassigned_shards: 0, number_of_pending_tasks: 5, number_of_in_f [22:34:09] tch: 0, task_max_waiting_in_queue_millis: 290, active_shards_percent_as_number: 82.92220113851992 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:34:15] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 791, active_shards: 1322, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 255, delayed_unassigned_shards: 0, number_of_pending_tasks: 3, number_of_in_f [22:34:15] tch: 0, task_max_waiting_in_queue_millis: 689, active_shards_percent_as_number: 83.61796331435801 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:34:15] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 791, active_shards: 1325, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 255, delayed_unassigned_shards: 0, number_of_pending_tasks: 6, number_of_in_f [22:34:15] tch: 0, task_max_waiting_in_queue_millis: 201, active_shards_percent_as_number: 83.80771663504112 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:34:19] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 791, active_shards: 1333, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 248, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_f [22:34:19] tch: 0, task_max_waiting_in_queue_millis: 69, active_shards_percent_as_number: 84.31372549019608 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:34:19] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 791, active_shards: 1333, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 248, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_f [22:34:19] tch: 0, task_max_waiting_in_queue_millis: 398, active_shards_percent_as_number: 84.31372549019608 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:34:19] PROBLEM - NTP peers on dns5003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [22:34:21] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 791, active_shards: 1336, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 244, delayed_unassigned_shards: 0, number_of_pending_tasks: 3, number_of_in_f [22:34:21] tch: 0, task_max_waiting_in_queue_millis: 1327, active_shards_percent_as_number: 84.5034788108792 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:34:41] FIRING: [2x] KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [22:34:46] FIRING: [5x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 24.01s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:34:58] FIRING: [38x] ProbeDown: Service dse-k8s-ctrl1002:6443 has failed probes (http_dse_k8s_eqiad_kube_apiserver_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:35:11] RECOVERY - NTP peers on dns5003 is OK: NTP OK: Offset -0.00027502 secs https://wikitech.wikimedia.org/wiki/NTP [22:35:47] RECOVERY - Host mwlog1002 is UP: PING WARNING - Packet loss = 90%, RTA = 10.10 ms [22:35:48] FIRING: [19x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:35:56] FIRING: RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [22:35:57] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:35:57] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:35:59] PROBLEM - MariaDB Replica Lag: m1 on db1217 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 448.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:36:51] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [22:36:53] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [22:36:56] FIRING: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:37:03] PROBLEM - SSH on mwlog1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:37:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [22:37:36] FIRING: [2x] GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [22:37:40] FIRING: [12x] KubernetesRsyslogDown: rsyslog on kubernetes1008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:37:48] FIRING: [3x] Primary inbound port utilisation over 80% #page: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [22:37:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:37:56] FIRING: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [22:37:56] Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [22:38:13] PROBLEM - SSH on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:38:15] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:38:18] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:38:33] RESOLVED: KubernetesAPINotScrapable: k8s-dse@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:38:48] FIRING: [20x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:38:50] PROBLEM - SSH on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:38:59] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:39:10] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 811, active_shards: 1509, relocating_shards: 2, initializing_shards: 2, unassigned_shards: 110, delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_ [22:39:10] t_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.09068476249229 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:32] PROBLEM - SSH on wdqs1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:39:33] PROBLEM - Auth DNS on dns5003 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [22:39:50] RECOVERY - Hadoop Namenode - Primary on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [22:40:00] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:40:00] FIRING: [37x] ProbeDown: Service dse-k8s-ctrl1002:6443 has failed probes (http_dse_k8s_eqiad_kube_apiserver_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:40:30] RECOVERY - Auth DNS on dns5003 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [22:40:48] FIRING: [23x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:40:50] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [22:41:10] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:41:42] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:41:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [22:41:50] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:42:00] ( i would assume someone has been notified automatically, but since IRC states that kamila_ (SRE on duty) is "away/idle", I'm pinging just in case....sorry if disturbing) [22:42:25] 👍 [22:42:27] yes, SREs have acknowledged the pages and are working on it [22:42:42] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 225, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:42:46] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 4.205 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:42:48] RECOVERY - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/search AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [22:42:48] AntiComposite: thanks! [22:42:51] FIRING: [21x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:42:54] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:42:56] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:43:00] RECOVERY - SSH on mwlog1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:43:06] RECOVERY - SSH on wdqs1012 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:43:20] RECOVERY - Etcd cluster health on dse-k8s-etcd1003 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [22:43:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52196 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:43:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:43:47] FIRING: [31x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:43:53] FIRING: [7x] KubernetesAPILatency: High Kubernetes API latency (LIST endpointslices) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:44:00] RECOVERY - MariaDB Replica Lag: m1 on db1217 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:44:15] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 24.06s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:44:26] RECOVERY - SSH on wdqs1018 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:44:41] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [22:44:52] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1149592456 and 72 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:44:58] RESOLVED: [35x] ProbeDown: Service dse-k8s-ctrl1002:6443 has failed probes (http_dse_k8s_eqiad_kube_apiserver_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:45:37] RESOLVED: [11x] ProbeDown: Service miscweb1003:30443 has failed probes (http_15_wikipedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:45:45] FIRING: [2x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [22:45:48] RESOLVED: [33x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:46:43] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [22:46:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:46:51] RESOLVED: [5x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:47:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [22:47:36] FIRING: [2x] GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [22:47:48] RESOLVED: Primary inbound port utilisation over 80% #page: Device asw2-c-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [22:47:51] FIRING: [21x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:47:52] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 137160 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:48:15] RESOLVED: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:48:25] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:53] RESOLVED: [7x] KubernetesAPILatency: High Kubernetes API latency (LIST endpointslices) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:49:03] RESOLVED: [2x] KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [22:49:15] RESOLVED: [4x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.015s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:50:15] RESOLVED: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:50:45] RESOLVED: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [22:50:56] RESOLVED: RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [22:51:02] RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [22:51:40] RECOVERY - CirrusSearch comp_suggest eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [22:52:47] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [22:52:51] RESOLVED: [21x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:52:56] RESOLVED: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [22:52:56] Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [22:53:25] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:53:34] RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [22:56:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:58:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:14:42] (03PS1) 10Cwhite: logstash: drop majority of runPrimaryTransactionIdleCallbacks warnings [puppet] - 10https://gerrit.wikimedia.org/r/1048863 [23:18:07] (03CR) 10Cwhite: [V:03+2 C:03+2] logstash: drop majority of runPrimaryTransactionIdleCallbacks warnings [puppet] - 10https://gerrit.wikimedia.org/r/1048863 (owner: 10Cwhite) [23:26:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:47] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:33:41] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048864 [23:38:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048864 (owner: 10TrainBranchBot) [23:51:57] (03PS1) 10Cwhite: logstash: expand message match [puppet] - 10https://gerrit.wikimedia.org/r/1048865 [23:53:29] (03CR) 10Cwhite: [V:03+2 C:03+2] logstash: expand message match [puppet] - 10https://gerrit.wikimedia.org/r/1048865 (owner: 10Cwhite) [23:57:35] evenin', are we aware of issues with PDF thumbnail generation? I'm trying to edit Wikisource but cannot load any of my page thumbnails and have been getting quite a lot Error 429s [23:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed