[00:00:04] RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:36] PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:52] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:14:07] 10SRE, 10Phabricator, 10User-Matthewrbowker: [Discussion] Phabricator has been declared EOL - https://phabricator.wikimedia.org/T283980 (10Kizule) So sad. Are there any good alternatives? [03:11:24] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:12:08] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:32:38] RECOVERY - ensure kvm processes are running on cloudvirt1038 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [05:08:22] PROBLEM - BGP status on cr3-knams is CRITICAL: BGP CRITICAL - AS1257/IPv4: Active - Tele2, AS1257/IPv6: Connect - Tele2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:42:02] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [06:06:19] 10SRE, 10Wikimedia-Mailing-lists: wikimedia interwiki link for mailman3 archives - https://phabricator.wikimedia.org/T283900 (10Legoktm) https://meta.wikimedia.org/wiki/Talk:Interwiki_map#Mailman3 [06:22:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Remove previous SSH key for Andrew Kostka - https://phabricator.wikimedia.org/T283940 (10Marostegui) p:05Triage→03Medium a:03Marostegui [06:24:35] 10SRE, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10Marostegui) p:05Triage→03Medium @Ladsgroup would you mind using the template at https://phabricator.wikimedia.org/maniphest/task/edit/for... [06:24:59] 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber): build and import blubber package for buster and bullseye (which supports v4) - https://phabricator.wikimedia.org/T283891 (10Marostegui) p:05Triage→03Medium [06:27:44] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Strange Swedish date format in lists.wikimedia.org - https://phabricator.wikimedia.org/T283967 (10Marostegui) p:05Triage→03Medium [06:27:53] 10SRE, 10Wikimedia-Mailing-lists: Poor link parsing in HyperKitty (Mailman 3) web archive - https://phabricator.wikimedia.org/T283909 (10Marostegui) p:05Triage→03Medium [06:28:11] 10SRE, 10Wikimedia-Mailing-lists: wikimedia interwiki link for mailman3 archives - https://phabricator.wikimedia.org/T283900 (10Marostegui) p:05Triage→03Medium [06:37:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:39:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:48:48] 10SRE, 10Wikimedia-Mailing-lists: Poor link parsing in HyperKitty (Mailman 3) web archive - https://phabricator.wikimedia.org/T283909 (10Legoktm) I *think* this is https://code.djangoproject.com/ticket/29826, if not exactly it's the same underlying issue, which is that hyperkitty is passing HTML to the urlizet... [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210530T0700) [07:09:08] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [07:14:36] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:15:22] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:46:01] 10SRE, 10Mail, 10User-greg: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416 (10Aklapper) [09:46:42] 10SRE, 10Mail, 10Phabricator: DomainKeys Identified Mail (DKIM) for phabricator.wikimedia.org - https://phabricator.wikimedia.org/T116805 (10Aklapper) 05Open→03Resolved >>! In T116805#4955526, @mmodell wrote: > afaik the outbound email path was completely redone since this task was filed. Closing this t... [09:50:22] 10SRE, 10Mail, 10Phabricator: DomainKeys Identified Mail (DKIM) for phabricator.wikimedia.org - https://phabricator.wikimedia.org/T116805 (10Legoktm) For reference, the DKIM result for the comment resolving this task is a pass: ` Authentication-Results: mx1.riseup.net; dkim=pass (1024-bit key; unprotected)... [11:09:24] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:36] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:19:20] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:42:21] (03Abandoned) 10Luca Mauri: Create new http://www.mediawiki.org/xml/sitelist-1.1/ to reference sitelist-1.1.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri) [16:21:36] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:07:08] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:26] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:40:42] 10SRE, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10Ladsgroup) Sure, but for an existing membership, is it really needed? Like T223137? [18:42:09] 10SRE, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10Ladsgroup) [19:30:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:32:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:17:04] PROBLEM - Host db2100 is DOWN: PING CRITICAL - Packet loss = 100% [20:18:18] RECOVERY - Host db2100 is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms [20:18:30] PROBLEM - MariaDB Replica SQL: s8 on db2100 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:20:08] PROBLEM - MariaDB Replica IO: s8 on db2100 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:22:26] PROBLEM - MariaDB read only s8 on db2100 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [20:23:14] PROBLEM - mysqld processes on db2100 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [20:23:34] PROBLEM - MariaDB read only s7 on db2100 is CRITICAL: Could not connect to localhost:3317 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [20:23:50] PROBLEM - MariaDB Replica IO: s7 on db2100 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:23:50] PROBLEM - MariaDB Replica SQL: s7 on db2100 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:28:05] Does ^ need a task! [20:28:06] ? [20:34:54] I'll create one in case [20:35:08] PROBLEM - MariaDB Replica Lag: s7 on db2100 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:37:20] 10SRE, 10DBA: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10RhinosF1) [21:07:16] restarts are manual, but that it rebooted by itself (?) does deserve a task indeed, if someone wasn't working on it [21:17:34] apergos: my guess is for some reason it rebooted and it left mysql in a funny state [21:19:35] a manual restart of mysql might be just fine, but that's not for me to do at midnight, for sure [21:19:51] not being a dba.... [21:21:13] I'd expect it needs someone to do some magic [21:21:28] Probably clean up whatever caused it or check the state [21:21:41] But that's just from being nosey on phab [21:22:12] But yeah, best to let DBAs look