[00:03:00] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [01:40:15] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:44:22] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:59:04] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 43.58 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [02:01:48] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [02:05:54] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [02:15:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [02:20:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [02:24:56] (03PS1) 10Samtar: InitialiseSettings: Set wgRestrictDisplayTitle = false for specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770120 (https://phabricator.wikimedia.org/T303665) [04:11:02] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:49:46] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:27:46] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:48:36] RECOVERY - Check systemd state on datahubsearch1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:57:00] PROBLEM - Check systemd state on datahubsearch1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:05:54] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [06:29:18] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:34:22] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 40.26 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:37:10] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:55:08] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220313T0800) [08:27:38] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:30] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 67 probes of 669 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:41:12] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 61 probes of 669 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:46:20] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 32.9 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:46:38] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 52.3 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:47:29] That has an accompanying rise 30 mins ago [09:49:04] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:49:20] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 100.1 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:58:48] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:05:54] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [13:29:38] 10SRE-OnFire, 10DBA, 10Performance-Team, 10Wikimedia-Rdbms, and 2 others: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure - https://phabricator.wikimedia.org/T303499 (10Aklapper) [13:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [14:05:54] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [14:40:57] (03PS1) 10Ladsgroup: Change A/V player to videojs in the first batch of production wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770130 (https://phabricator.wikimedia.org/T248418) [17:07:57] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T303585 (10Papaul) Create Dispatch: Success You have successfully submitted request SR1087069289. [17:10:36] 10SRE, 10ops-codfw, 10decommission-hardware: decommission ganeti2008 - https://phabricator.wikimedia.org/T302578 (10Papaul) 05Open→03Resolved [17:12:17] hey y'all, got a DBQueryError when trying to attach an account locally on enwiki. Went away when I retried, is this worth filing a Phab ticket over? [17:13:36] do you have the request Id? [17:14:16] yes, one sec [17:14:53] full message is [2898c529-42d5-4826-83bd-2924e7f7cafd] 2022-03-13 17:06:19: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" [17:14:59] I assume the bit in the brackets is the ID [17:15:11] looking [17:15:19] and yes, that's the reqId [17:15:50] "Deadlock found when trying to get lock" [17:16:51] looks like T294995 [17:16:52] T294995: Deadlocks from job setting VectorSkinVersion user preference to 1 - https://phabricator.wikimedia.org/T294995 [17:18:01] indeed, nice find [17:20:55] reopened that one [17:21:02] GenNotability, thanks for the report :) [17:21:38] no problem, thanks for the quick response! [17:27:09] oh fun, creating another account on enwiki (to ensure that vector specific config which is different on testwiki does not effect testing) worked just fine, https://en.wikipedia.org/w/index.php?title=Special:Log&logid=128863499 [17:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:05:54] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [18:23:00] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:34] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:06] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:50:32] 10SRE, 10Infrastructure-Foundations, 10Mail: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2 - https://phabricator.wikimedia.org/T240906 (10Aklapper) a:05herron→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the... [19:51:12] 10SRE, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Seen): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10Aklapper) a:05akosiaris→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See... [19:51:50] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: puppetmasters: update the puppet masters so they use them self for the puppet run - https://phabricator.wikimedia.org/T238093 (10Aklapper) a:05jbond→03None Removing task assignee due to inactivity, as this open task has been assigned for more... [19:52:10] 10SRE: stunnel-wrap all rsync::server usage - https://phabricator.wikimedia.org/T237424 (10Aklapper) a:05CDanis→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and `T295729`). Please... [19:52:24] 10SRE, 10serviceops, 10User-jijiki: Move debugging symbols and tools to a new class - https://phabricator.wikimedia.org/T236048 (10Aklapper) a:05jijiki→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on... [19:53:08] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: reimage of puppet servers can fail - https://phabricator.wikimedia.org/T235067 (10Aklapper) a:05jbond→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task... [19:53:14] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: missing CRL - https://phabricator.wikimedia.org/T235185 (10Aklapper) a:05jbond→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 0... [19:53:42] 10SRE, 10LDAP, 10User-jbond: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See t... [19:53:59] 10SRE, 10Traffic-Icebox: Implement GeoDNS smooth repooling in gdnsd - https://phabricator.wikimedia.org/T228678 (10Aklapper) a:05BBlack→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022... [19:55:17] 10SRE, 10serviceops, 10Release-Engineering-Team (Radar): Hundreds of tags for `wikimedia/mediawiki-core` image - https://phabricator.wikimedia.org/T242775 (10Aklapper) a:05Joe→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent... [19:55:27] 10SRE, 10Traffic-Icebox, 10serviceops, 10Patch-Needs-Improvement: Investigate the remaining usage of X-Real-IP - https://phabricator.wikimedia.org/T239340 (10Aklapper) a:05akosiaris→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the ema... [19:56:07] 10SRE, 10Release Pipeline, 10serviceops, 10Kubernetes: Identify which parts of the "Add a wiki" procedure can be integrated with the deployment pipeline - https://phabricator.wikimedia.org/T238158 (10Aklapper) a:05akosiaris→03None Removing task assignee due to inactivity, as this open task has been ass... [19:59:46] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:01:30] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:18:42] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:05:54] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [22:12:32] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:14:18] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook