[00:00:01] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:32] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:54:06] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:04:08] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: daily_account_consistency_check.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:54:16] ACKNOWLEDGEMENT - MariaDB Replica IO: s7 on db2100 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:16] ACKNOWLEDGEMENT - MariaDB Replica IO: s8 on db2100 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:16] ACKNOWLEDGEMENT - MariaDB Replica Lag: s7 on db2100 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:16] ACKNOWLEDGEMENT - MariaDB Replica SQL: s7 on db2100 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:16] ACKNOWLEDGEMENT - MariaDB Replica SQL: s8 on db2100 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:17] ACKNOWLEDGEMENT - MariaDB read only s7 on db2100 is CRITICAL: Could not connect to localhost:3317 Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:54:17] ACKNOWLEDGEMENT - MariaDB read only s8 on db2100 is CRITICAL: Could not connect to localhost:3318 Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:54:18] ACKNOWLEDGEMENT - mysqld processes on db2100 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Marostegui Known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:13:51] (03CR) 10Marostegui: "Can I get a sanity check on this patch please?" [puppet] - 10https://gerrit.wikimedia.org/r/695872 (https://phabricator.wikimedia.org/T283648) (owner: 10Marostegui) [05:18:06] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 126 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [05:19:06] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:21:17] ^ Ack, will restart in a bit [06:10:52] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [06:14:28] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [06:22:40] !log sudo systemctl restart mailman3 on lists1001, bounce runner crashed [06:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:24] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:30:40] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:32:21] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Legoktm) Now crashing with: ` May 31 06:26:05 lists1001 mailman3[31349]: File "/usr/lib/python3/dist-packages/mailman/core/runner.py", line 134, in run May 31 06:26:05 lists1001 mail... [06:33:31] !log manually unsubscribed ahalfaker [at] wikimedia.org from scoring-internal list, triggering mailman bounce loop T282348#7124014 [06:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:36] T282348: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 [06:34:18] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:38:18] the mailman queue is going to be enourmous right now, I'm not sure what triggered it, but we're just unsubscribed some 7k hard bouncing addresses [06:41:36] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:43:26] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:50:44] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:51:18] # the header/footer decorator. XXX 2012-03-05 this is probably [06:51:18] # highly inefficient on the database. [06:51:20] wow, you don't say [06:52:34] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:52:56] I restarted it once more, not going to do it again tonight [06:53:18] it's possible list mail delivery will be delayed [06:59:54] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210531T0700) [07:23:01] !log deleting all outgoing list mail that has a subject that starts with "You have been unsubscribed from the" T284003 [07:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:06] T284003: Enourmous mailman3 outgoing queue - https://phabricator.wikimedia.org/T284003 [07:25:37] that cut the queue in half [07:30:45] !log deleted all outoing list mail that is for a yahoo/aol address being unsubscribed T284003 [07:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:49] T284003: Enourmous mailman3 outgoing queue - https://phabricator.wikimedia.org/T284003 [07:31:04] I think the rest is gmail [07:31:58] hm, that was only like 100 emails [07:32:04] !log deleted all outoing list mail that is for a gmail address being unsubscribed T284003 [07:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:20] the rest is all like random domains [07:36:24] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:43:44] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:45:41] ACKNOWLEDGEMENT - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 928 (limit: 25) Legoktm T284003 https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [07:45:51] ACKNOWLEDGEMENT - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner Legoktm T284003 https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:49:37] good night [08:55:19] 10SRE, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889 (10Cervisiarius) Hi there, After years of inactivity, I just wanted to log onto the WMF cluster again (via `ssh stat1006.eqiad.wmnet`), but encountered... [09:07:05] 10SRE, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889 (10RhinosF1) Hi, You can find standard SSH config on https://wikitech.wikimedia.org/wiki/Production_access#Setting_up_your_SSH_config [09:13:00] RECOVERY - BGP status on cr3-knams is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:38:47] 10SRE, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889 (10Cervisiarius) Thanks, this helped! I realized that I was using the wrong SSH key, and now, when using a different one, it works. [09:39:26] 10SRE, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Request public key change for a research fellow - https://phabricator.wikimedia.org/T177889 (10RhinosF1) Glad you could fix it! [10:01:55] (03PS1) 10Jcrespo: mariadb: Disable notifications on db2100 to handle its crash [puppet] - 10https://gerrit.wikimedia.org/r/697403 (https://phabricator.wikimedia.org/T283995) [10:02:24] (03PS2) 10Jcrespo: mariadb: Disable notifications on db2100 to handle its crash [puppet] - 10https://gerrit.wikimedia.org/r/697403 (https://phabricator.wikimedia.org/T283995) [10:04:18] (03CR) 10RhinosF1: [C: 03+1] mariadb: Disable notifications on db2100 to handle its crash [puppet] - 10https://gerrit.wikimedia.org/r/697403 (https://phabricator.wikimedia.org/T283995) (owner: 10Jcrespo) [10:05:09] (03CR) 10Jcrespo: [C: 03+2] mariadb: Disable notifications on db2100 to handle its crash [puppet] - 10https://gerrit.wikimedia.org/r/697403 (https://phabricator.wikimedia.org/T283995) (owner: 10Jcrespo) [10:06:40] Ty jynus [10:19:48] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:28:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:52:36] (03PS1) 10Urbanecm: Revert "enwiktionary: Raise AF emergency disable treshold+count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697119 (https://phabricator.wikimedia.org/T283460) [10:52:56] (03PS2) 10Urbanecm: Revert "enwiktionary: Raise AF emergency disable treshold+count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697119 (https://phabricator.wikimedia.org/T283460) [11:36:32] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:37:16] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:00:04] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:53] 10SRE, 10Wikimedia-Mailing-lists: Enourmous mailman3 outgoing queue - https://phabricator.wikimedia.org/T284003 (10Ladsgroup) It has somewhat recovered but still really high https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&from=1622435134965&to=1622472934965 [15:26:05] 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10jcrespo) [15:38:44] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:40:40] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1005 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:41:28] PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 4.496e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:42:28] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:53:21] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) a:05jcrespo→03Papaul Data has been recovered, I am generating a new backup now. This should be still under warranty- most likely cause (se... [16:29:18] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:20] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:46:54] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:00:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:26:30] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:32] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation), 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10LSobanski) [20:43:26] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:18:32] PROBLEM - snapshot of s8 in codfw on alert1001 is CRITICAL: snapshot for s8 at codfw taken more than 3 days ago: Most recent backup 2021-05-28 20:58:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:22:55] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10RhinosF1) Backup freshness alert for s8 went off a few moments ago. Not sure if ack/downtime worth it. [21:49:18] RECOVERY - snapshot of s8 in codfw on alert1001 is OK: Last snapshot for s8 at codfw (db2100.codfw.wmnet:3318) taken on 2021-05-31 20:20:13 (1252 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [23:43:48] PROBLEM - BGP status on cr3-knams is CRITICAL: BGP CRITICAL - AS1257/IPv4: Active - Tele2, AS1257/IPv6: Active - Tele2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:52:52] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:54:40] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:56:43] network maintenance?