[00:00:01] <icinga-wm>	 RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:01:34] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:32] <icinga-wm>	 PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:54:06] <icinga-wm>	 PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:04:08] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: daily_account_consistency_check.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:54:16] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica IO: s7 on db2100 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:54:16] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica IO: s8 on db2100 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:54:16] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica Lag: s7 on db2100 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:54:16] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica SQL: s7 on db2100 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:54:16] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica SQL: s8 on db2100 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:54:17] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB read only s7 on db2100 is CRITICAL: Could not connect to localhost:3317 Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[04:54:17] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB read only s8 on db2100 is CRITICAL: Could not connect to localhost:3318 Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[04:54:18] <icinga-wm>	 ACKNOWLEDGEMENT - mysqld processes on db2100 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[05:13:51] <wikibugs>	 (03CR) 10Marostegui: "Can I get a sanity check on this patch please?" [puppet] - 10https://gerrit.wikimedia.org/r/695872 (https://phabricator.wikimedia.org/T283648) (owner: 10Marostegui)
[05:18:06] <icinga-wm>	 PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 126 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[05:19:06] <icinga-wm>	 PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:21:17] <legoktm>	 ^ Ack, will restart in a bit
[06:10:52] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[06:14:28] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[06:22:40] <legoktm>	 !log sudo systemctl restart mailman3 on lists1001, bounce runner crashed
[06:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:23:24] <icinga-wm>	 RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:30:40] <icinga-wm>	 PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:32:21] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Legoktm) Now crashing with: ` May 31 06:26:05 lists1001 mailman3[31349]:   File "/usr/lib/python3/dist-packages/mailman/core/runner.py", line 134, in run May 31 06:26:05 lists1001 mail...
[06:33:31] <legoktm>	 !log manually unsubscribed ahalfaker [at] wikimedia.org from scoring-internal list, triggering mailman bounce loop T282348#7124014
[06:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:33:36] <stashbot>	 T282348: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348
[06:34:18] <icinga-wm>	 RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:38:18] <legoktm>	 the mailman queue is going to be enourmous right now, I'm not sure what triggered it, but we're just unsubscribed some 7k hard bouncing addresses
[06:41:36] <icinga-wm>	 PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:43:26] <icinga-wm>	 RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:50:44] <icinga-wm>	 PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:51:18] <legoktm>	             # the header/footer decorator.  XXX 2012-03-05 this is probably
[06:51:18] <legoktm>	             # highly inefficient on the database.
[06:51:20] <legoktm>	 wow, you don't say
[06:52:34] <icinga-wm>	 RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:52:56] <legoktm>	 I restarted it once more, not going to do it again tonight
[06:53:18] <legoktm>	 it's possible list mail delivery will be delayed
[06:59:54] <icinga-wm>	 PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210531T0700)
[07:23:01] <legoktm>	 !log deleting all outgoing list mail that has a subject that starts with "You have been unsubscribed from the" T284003
[07:23:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:06] <stashbot>	 T284003: Enourmous mailman3 outgoing queue - https://phabricator.wikimedia.org/T284003
[07:25:37] <legoktm>	 that cut the queue in half
[07:30:45] <legoktm>	 !log deleted all outoing list mail that is for a yahoo/aol address being unsubscribed T284003
[07:30:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:49] <stashbot>	 T284003: Enourmous mailman3 outgoing queue - https://phabricator.wikimedia.org/T284003
[07:31:04] <legoktm>	 I think the rest is gmail
[07:31:58] <legoktm>	 hm, that was only like 100 emails
[07:32:04] <legoktm>	 !log deleted all outoing list mail that is for a gmail address being unsubscribed T284003
[07:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:20] <legoktm>	 the rest is all like random domains
[07:36:24] <icinga-wm>	 RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:43:44] <icinga-wm>	 PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:45:41] <icinga-wm>	 ACKNOWLEDGEMENT - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 928 (limit: 25) Legoktm T284003 https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[07:45:51] <icinga-wm>	 ACKNOWLEDGEMENT - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner Legoktm T284003 https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:49:37] <legoktm>	 good night
[08:55:19] <wikibugs>	 10SRE, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889 (10Cervisiarius) Hi there,  After years of inactivity, I just wanted to log onto the WMF cluster again (via `ssh stat1006.eqiad.wmnet`), but encountered...
[09:07:05] <wikibugs>	 10SRE, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889 (10RhinosF1) Hi,  You can find standard SSH config on https://wikitech.wikimedia.org/wiki/Production_access#Setting_up_your_SSH_config
[09:13:00] <icinga-wm>	 RECOVERY - BGP status on cr3-knams is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:38:47] <wikibugs>	 10SRE, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889 (10Cervisiarius) Thanks, this helped! I realized that I was using the wrong SSH key, and now, when using a different one, it works.
[09:39:26] <wikibugs>	 10SRE, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889 (10RhinosF1) Glad you could fix it!
[10:01:55] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Disable notifications on db2100 to handle its crash [puppet] - 10https://gerrit.wikimedia.org/r/697403 (https://phabricator.wikimedia.org/T283995)
[10:02:24] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Disable notifications on db2100 to handle its crash [puppet] - 10https://gerrit.wikimedia.org/r/697403 (https://phabricator.wikimedia.org/T283995)
[10:04:18] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] mariadb: Disable notifications on db2100 to handle its crash [puppet] - 10https://gerrit.wikimedia.org/r/697403 (https://phabricator.wikimedia.org/T283995) (owner: 10Jcrespo)
[10:05:09] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Disable notifications on db2100 to handle its crash [puppet] - 10https://gerrit.wikimedia.org/r/697403 (https://phabricator.wikimedia.org/T283995) (owner: 10Jcrespo)
[10:06:40] <RhinosF1>	 Ty jynus
[10:19:48] <icinga-wm>	 RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:27:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:28:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:52:36] <wikibugs>	 (03PS1) 10Urbanecm: Revert "enwiktionary: Raise AF emergency disable treshold+count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697119 (https://phabricator.wikimedia.org/T283460)
[10:52:56] <wikibugs>	 (03PS2) 10Urbanecm: Revert "enwiktionary: Raise AF emergency disable treshold+count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697119 (https://phabricator.wikimedia.org/T283460)
[11:36:32] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:37:16] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:00:04] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:57:53] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Enourmous mailman3 outgoing queue - https://phabricator.wikimedia.org/T284003 (10Ladsgroup) It has somewhat recovered but still really high https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&from=1622435134965&to=1622472934965
[15:26:05] <wikibugs>	 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10jcrespo)
[15:38:44] <icinga-wm>	 PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:40:40] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1005 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:41:28] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 4.496e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[15:42:28] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:53:21] <wikibugs>	 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) a:05jcrespo→03Papaul Data has been recovered, I am generating a new backup now.  This should be still under warranty- most likely cause (se...
[16:29:18] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:39:20] <icinga-wm>	 RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:46:54] <icinga-wm>	 PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:58:58] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:00:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:26:30] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:18:32] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence (Consultation), 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10LSobanski)
[20:43:26] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:18:32] <icinga-wm>	 PROBLEM - snapshot of s8 in codfw on alert1001 is CRITICAL: snapshot for s8 at codfw taken more than 3 days ago: Most recent backup 2021-05-28 20:58:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[21:22:55] <wikibugs>	 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10RhinosF1) Backup freshness alert for s8 went off a few moments ago. Not sure if ack/downtime worth it.
[21:49:18] <icinga-wm>	 RECOVERY - snapshot of s8 in codfw on alert1001 is OK: Last snapshot for s8 at codfw (db2100.codfw.wmnet:3318) taken on 2021-05-31 20:20:13 (1252 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[23:43:48] <icinga-wm>	 PROBLEM - BGP status on cr3-knams is CRITICAL: BGP CRITICAL - AS1257/IPv4: Active - Tele2, AS1257/IPv6: Active - Tele2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:52:52] <icinga-wm>	 PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:54:40] <icinga-wm>	 RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:56:43] <Southparkfan>	 network maintenance?