[00:00:04] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:00:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:36] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:36] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:08:52] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:14:07] <wikibugs>	 10SRE, 10Phabricator, 10User-Matthewrbowker: [Discussion] Phabricator has been declared EOL - https://phabricator.wikimedia.org/T283980 (10Kizule) So sad. Are there any good alternatives?
[03:11:24] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:12:08] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:32:38] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1038 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[05:08:22] <icinga-wm>	 PROBLEM - BGP status on cr3-knams is CRITICAL: BGP CRITICAL - AS1257/IPv4: Active - Tele2, AS1257/IPv6: Connect - Tele2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:42:02] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[06:06:19] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: wikimedia interwiki link for mailman3 archives - https://phabricator.wikimedia.org/T283900 (10Legoktm) https://meta.wikimedia.org/wiki/Talk:Interwiki_map#Mailman3
[06:22:59] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Remove previous SSH key for Andrew Kostka - https://phabricator.wikimedia.org/T283940 (10Marostegui) p:05Triage→03Medium a:03Marostegui
[06:24:35] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10Marostegui) p:05Triage→03Medium @Ladsgroup would you mind using the template at https://phabricator.wikimedia.org/maniphest/task/edit/for...
[06:24:59] <wikibugs>	 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber): build and import blubber package for buster and bullseye (which supports v4) - https://phabricator.wikimedia.org/T283891 (10Marostegui) p:05Triage→03Medium
[06:27:44] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Strange Swedish date format in lists.wikimedia.org - https://phabricator.wikimedia.org/T283967 (10Marostegui) p:05Triage→03Medium
[06:27:53] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Poor link parsing in HyperKitty (Mailman 3) web archive - https://phabricator.wikimedia.org/T283909 (10Marostegui) p:05Triage→03Medium
[06:28:11] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: wikimedia interwiki link for mailman3 archives - https://phabricator.wikimedia.org/T283900 (10Marostegui) p:05Triage→03Medium
[06:37:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:39:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:48:48] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Poor link parsing in HyperKitty (Mailman 3) web archive - https://phabricator.wikimedia.org/T283909 (10Legoktm) I *think* this is https://code.djangoproject.com/ticket/29826, if not exactly it's the same underlying issue, which is that hyperkitty is passing HTML to the urlizet...
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210530T0700)
[07:09:08] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[07:14:36] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:15:22] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:46:01] <wikibugs>	 10SRE, 10Mail, 10User-greg: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416 (10Aklapper)
[09:46:42] <wikibugs>	 10SRE, 10Mail, 10Phabricator: DomainKeys Identified Mail (DKIM) for phabricator.wikimedia.org - https://phabricator.wikimedia.org/T116805 (10Aklapper) 05Open→03Resolved >>! In T116805#4955526, @mmodell wrote: > afaik the outbound email path was completely redone since this task was filed.  Closing this t...
[09:50:22] <wikibugs>	 10SRE, 10Mail, 10Phabricator: DomainKeys Identified Mail (DKIM) for phabricator.wikimedia.org - https://phabricator.wikimedia.org/T116805 (10Legoktm) For reference, the DKIM result for the comment resolving this task is a pass: ` Authentication-Results: mx1.riseup.net;  dkim=pass (1024-bit key; unprotected)...
[11:09:24] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:36] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:19:20] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:42:21] <wikibugs>	 (03Abandoned) 10Luca Mauri: Create new http://www.mediawiki.org/xml/sitelist-1.1/ to reference sitelist-1.1.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri)
[16:21:36] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:07:08] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:22:26] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:40:42] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10Ladsgroup) Sure, but for an existing membership, is it really needed? Like T223137?
[18:42:09] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10Ladsgroup)
[19:30:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:32:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:17:04] <icinga-wm>	 PROBLEM - Host db2100 is DOWN: PING CRITICAL - Packet loss = 100%
[20:18:18] <icinga-wm>	 RECOVERY - Host db2100 is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms
[20:18:30] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s8 on db2100 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:20:08] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s8 on db2100 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:22:26] <icinga-wm>	 PROBLEM - MariaDB read only s8 on db2100 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[20:23:14] <icinga-wm>	 PROBLEM - mysqld processes on db2100 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[20:23:34] <icinga-wm>	 PROBLEM - MariaDB read only s7 on db2100 is CRITICAL: Could not connect to localhost:3317 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[20:23:50] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s7 on db2100 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:23:50] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s7 on db2100 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:28:05] <RhinosF1>	 Does ^ need a task!
[20:28:06] <RhinosF1>	 ?
[20:34:54] <RhinosF1>	 I'll create one in case
[20:35:08] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 on db2100 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:37:20] <wikibugs>	 10SRE, 10DBA: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10RhinosF1)
[21:07:16] <apergos>	 restarts are manual, but that it rebooted by itself (?) does deserve a task indeed, if someone wasn't working on it
[21:17:34] <RhinosF1>	 apergos: my guess is for some reason it rebooted and it left mysql in a funny state
[21:19:35] <apergos>	 a manual restart of mysql might be just fine, but that's not for me to do at midnight, for sure
[21:19:51] <apergos>	 not being a dba....
[21:21:13] <RhinosF1>	 I'd expect it needs someone to do some magic
[21:21:28] <RhinosF1>	 Probably clean up whatever caused it or check the state
[21:21:41] <RhinosF1>	 But that's just from being nosey on phab
[21:22:12] <RhinosF1>	 But yeah, best to let DBAs look