[00:04:11] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01028 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:43:19] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:00] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:40:22] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [02:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [02:52:09] PROBLEM - MegaRAID on es1029 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:52:10] ACKNOWLEDGEMENT - MegaRAID on es1029 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T302169 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:52:13] 10SRE, 10ops-eqiad: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T302169 (10ops-monitoring-bot) [03:41:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [03:46:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [04:02:21] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:59] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:42] (03PS1) 10Andrew Bogott: OpenStack cinder: monitor snapshots in 'admin' project [puppet] - 10https://gerrit.wikimedia.org/r/763919 [04:28:55] (03PS1) 10Andrew Bogott: Openstack Cinder: Include service monitoring [puppet] - 10https://gerrit.wikimedia.org/r/763920 [04:32:11] (03PS2) 10Andrew Bogott: OpenStack cinder: monitor snapshots in 'admin' project [puppet] - 10https://gerrit.wikimedia.org/r/763919 [04:36:02] (03PS2) 10Andrew Bogott: Openstack Cinder: Include service monitoring [puppet] - 10https://gerrit.wikimedia.org/r/763920 [04:36:04] (03PS3) 10Andrew Bogott: OpenStack cinder: monitor snapshots in 'admin' project [puppet] - 10https://gerrit.wikimedia.org/r/763919 [04:36:06] (03PS1) 10Andrew Bogott: Openstack monitoring: move fullstack instance check to python3 [puppet] - 10https://gerrit.wikimedia.org/r/763921 [05:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [06:20:56] 10SRE, 10ops-eqiad, 10Data-Persistence: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T302169 (10Peachey88) [06:23:56] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10Peachey88) [06:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220220T0800) [08:00:27] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 9.362e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:12:17] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:47:28] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10vrts: Not receiving VRT notification emails - https://phabricator.wikimedia.org/T302139 (10Stang) [09:51:07] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10vrts: Not receiving VRT notification emails - https://phabricator.wikimedia.org/T302139 (10Stang) Also as someone else [[https://vrt-wiki.wikimedia.org/w/index.php?oldid=114912#Email_don't_received|reported]]. [10:13:33] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:21:09] RECOVERY - WDQS high update lag on wdqs1006 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.068e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [12:02:49] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 35.31 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:05:15] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:27:49] !log taavi@deploy1002 Synchronized private/PrivateSettings.php: T302047 (duration: 00m 49s) [12:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:52] (03CR) 10Majavah: R:tlsproxy::localssl: Add cfssl support to tlsproxy::localssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762535 (owner: 10Jbond) [12:47:08] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10Zabe) [12:47:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:51:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:07] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:41:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [13:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:46:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:00:15] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01028 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:14:45] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01028 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:31:27] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [14:36:31] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:55:53] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:07:43] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10vrts: Not receiving VRT notification emails - https://phabricator.wikimedia.org/T302139 (10jbond) I noticed some db errors in the logs, after manually checking the db config was correct, I have restarted otrs-daemon.service and things look healthy... [15:09:40] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10vrts: Not receiving VRT notification emails - https://phabricator.wikimedia.org/T302139 (10GeneralNotability) Confirmed, just got a wave of emails. [15:10:49] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10vrts: Not receiving VRT notification emails - https://phabricator.wikimedia.org/T302139 (10Urbanecm) @jbond Thanks for the quick fix. I confirm emails started flowing back to my mailbox again :). [15:12:47] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10vrts: Not receiving VRT notification emails - https://phabricator.wikimedia.org/T302139 (10Daniuu) I can also confirm that a new waive of emails just started. Thanks for looking into this, @jbond . [15:14:28] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10vrts: Not receiving VRT notification emails - https://phabricator.wikimedia.org/T302139 (10jbond) 05Open→03Resolved a:03jbond awesome will close but please re-open if you see other issues [15:34:11] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10vrts: Not receiving VRT notification emails - https://phabricator.wikimedia.org/T302139 (10akosiaris) >>! In T302139#7724207, @jbond wrote: > I do see the following error in the log but this looks like it can wait for someone more knowledgable about... [15:44:14] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10vrts: Not receiving VRT notification emails - https://phabricator.wikimedia.org/T302139 (10Dzahn) Thank you @jbond ! We did talk about the log entries on Fridays but as you say the DB config looked correct. [15:48:42] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10vrts: Not receiving VRT notification emails - https://phabricator.wikimedia.org/T302139 (10Dzahn) There was a short outage of OTRS on Feb 16th. The config change I deployed was not the original cause but to fix that. Will share more details next week. [15:51:55] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:26:40] (03PS7) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 [16:26:42] (03PS5) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 [16:53:13] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:16:21] (03CR) 10Andrew Bogott: [C: 03+2] Openstack monitoring: move fullstack instance check to python3 [puppet] - 10https://gerrit.wikimedia.org/r/763921 (owner: 10Andrew Bogott) [17:23:48] (03CR) 10Giuseppe Lavagetto: "I do understand editing expressions by hand in etcd with basically no validation is exposing us to a lot of footguns; but my plan would be" [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [17:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:04:01] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:28:41] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Cinder: Include service monitoring [puppet] - 10https://gerrit.wikimedia.org/r/763920 (owner: 10Andrew Bogott) [18:29:47] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack cinder: monitor snapshots in 'admin' project [puppet] - 10https://gerrit.wikimedia.org/r/763919 (owner: 10Andrew Bogott) [18:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [19:04:16] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:12:16] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01028 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:22:22] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 868969 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [19:24:32] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:25:40] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:31:17] (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager.py: offset maps full upgrade [puppet] - 10https://gerrit.wikimedia.org/r/763953 (https://phabricator.wikimedia.org/T294429) [19:33:43] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup-manager.py: offset maps full upgrade [puppet] - 10https://gerrit.wikimedia.org/r/763953 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [20:21:06] (03PS1) 10Andrew Bogott: nfs add_server: disable nfs mounts for new nfs servers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/763955 [20:43:46] (03CR) 10Andrew Bogott: [C: 03+2] nfs add_server: disable nfs mounts for new nfs servers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/763955 (owner: 10Andrew Bogott) [21:15:44] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01028 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:25:40] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:26:32] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:28:56] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:30:28] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:32:28] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [23:41:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [23:46:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org