[00:05:26] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:18:06] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:21:26] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:22] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:26] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:02:46] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:10] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:01:48] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:06:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:58] (03PS1) 10Andrew Bogott: backy2: correct pool name for glance image backups [puppet] - 10https://gerrit.wikimedia.org/r/829294 (https://phabricator.wikimedia.org/T316980) [02:18:51] (03CR) 10Andrew Bogott: [C: 03+2] backy2: correct pool name for glance image backups [puppet] - 10https://gerrit.wikimedia.org/r/829294 (https://phabricator.wikimedia.org/T316980) (owner: 10Andrew Bogott) [02:34:30] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:53:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:57:20] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:17:36] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:41:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:46:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:55:34] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:02:00] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method [06:04:24] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [06:09:43] (03PS1) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) [06:12:46] (03PS2) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) [06:29:44] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:52:18] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:56:52] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [06:59:18] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220904T0700) [07:24:16] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:26:32] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [07:38:02] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:46:18] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [08:33:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [08:33:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P33760 and previous config saved to /var/cache/conftool/dbconfig/20220904-083341-ladsgroup.json [08:33:47] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [08:46:32] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:05:36] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:25:03] (03PS1) 10Majavah: P:toolforge::grid: remove legacy host key stuff [puppet] - 10https://gerrit.wikimedia.org/r/829305 [10:30:29] (03PS3) 10Majavah: dynamicproxy: improve /zones API [puppet] - 10https://gerrit.wikimedia.org/r/826986 (https://phabricator.wikimedia.org/T316463) [10:34:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T314041)', diff saved to https://phabricator.wikimedia.org/P33761 and previous config saved to /var/cache/conftool/dbconfig/20220904-103405-ladsgroup.json [10:34:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [10:34:09] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:34:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [10:34:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T314041)', diff saved to https://phabricator.wikimedia.org/P33762 and previous config saved to /var/cache/conftool/dbconfig/20220904-103427-ladsgroup.json [10:35:28] (03CR) 10Majavah: dynamicproxy: improve /zones API (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826986 (https://phabricator.wikimedia.org/T316463) (owner: 10Majavah) [11:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:50:09] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Aklapper) @Kelson: Could you please answer the last comment? Thanks in advance! [12:18:18] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:48:24] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:48:58] !log pkill remaining processes of user effeietsanders on stat1008 to unblock puppet - T314846 [12:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:02] T314846: Check home/HDFS leftovers of effeietsanders - https://phabricator.wikimedia.org/T314846 [12:50:45] !log reset-fail ifup@ens13.service on netflow4002 [12:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:53] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:51:24] !log reset-fail ifup@ens13.service on idp2002 [12:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:32] RECOVERY - Check systemd state on netflow4002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:32] RECOVERY - Check systemd state on idp2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:28] (03PS3) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) [13:30:16] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [13:32:44] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 12 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [13:47:36] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [13:50:04] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [14:16:20] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10kaldari) @MarkTraceur @Tgr Could someone with shell access at least look in some of the relevant... [14:55:28] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) @nskaggs @Aklapper Not sure to fully understand the question. The answer has been posted earli... [15:04:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Marostegui) Can this be resumed? [15:05:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P33763 and previous config saved to /var/cache/conftool/dbconfig/20220904-150508-ladsgroup.json [15:05:11] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [15:12:30] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:20:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P33764 and previous config saved to /var/cache/conftool/dbconfig/20220904-152015-ladsgroup.json [15:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:35:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P33765 and previous config saved to /var/cache/conftool/dbconfig/20220904-153521-ladsgroup.json [15:43:32] PROBLEM - Host elastic1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:43:32] PROBLEM - Host elastic1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:24] PROBLEM - Host ml-cache1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:47:36] PROBLEM - Host an-presto1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:50:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P33766 and previous config saved to /var/cache/conftool/dbconfig/20220904-155027-ladsgroup.json [15:50:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [15:50:31] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [15:50:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [15:51:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P33767 and previous config saved to /var/cache/conftool/dbconfig/20220904-155059-ladsgroup.json [16:07:19] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:15:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:17:10] PROBLEM - OpenSearch health check for shards on 9200 on logstash2027 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [16:20:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:24:08] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:26:42] PROBLEM - MD RAID on logstash2027 is CRITICAL: CRITICAL: State: degraded, Active: 22, Working: 22, Failed: 2, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:26:43] ACKNOWLEDGEMENT - MD RAID on logstash2027 is CRITICAL: CRITICAL: State: degraded, Active: 22, Working: 22, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T316996 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:26:47] 10SRE, 10ops-codfw: Degraded RAID on logstash2027 - https://phabricator.wikimedia.org/T316996 (10ops-monitoring-bot) [16:27:19] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:31:10] (03CR) 10Krinkle: [C: 03+1] mwscript: Support --force-version flag (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (https://phabricator.wikimedia.org/T303878) (owner: 10Ahmon Dancy) [16:46:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:54:54] (03PS4) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) [16:56:28] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [16:58:58] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [17:22:16] (03PS5) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) [17:50:40] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [17:53:08] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 8 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [18:16:50] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:25:26] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:26:38] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:27:42] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48681 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:28:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.612 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:47:45] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:47:45] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:47:45] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:47:50] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:58:50] PROBLEM - Outgoing network saturation on labstore1006 is CRITICAL: 1.129e+09 ge 1.062e+09 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [19:02:45] (Primary outbound port utilisation over 80% #page) firing: (2) Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [19:02:45] (Primary outbound port utilisation over 80% #page) firing: (2) Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [19:04:27] ACKNOWLEDGEMENT - Outgoing network saturation on labstore1006 is CRITICAL: 1.171e+09 ge 1.062e+09 Andrew Bogott this seems to be intermittent https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [19:06:13] RECOVERY - Outgoing network saturation on labstore1006 is OK: (C)1.062e+09 ge (W)9.375e+08 ge 1.15e+08 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [19:07:45] (Primary outbound port utilisation over 80% #page) firing: (2) Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [19:07:45] (Primary outbound port utilisation over 80% #page) firing: (2) Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [19:10:54] (03PS1) 10Ladsgroup: dumps: Deny Apple IPs [puppet] - 10https://gerrit.wikimedia.org/r/829312 [19:12:45] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [19:12:45] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [19:12:45] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [19:12:50] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [19:15:45] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10Tgr) These aren't directories, they are Swift containers. Our documentation on how to do common f... [19:15:51] 10SRE, 10Data-Services: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10CDanis) [19:16:15] (03PS2) 10Ladsgroup: dumps: Deny Apple IPs [puppet] - 10https://gerrit.wikimedia.org/r/829312 (https://phabricator.wikimedia.org/T317001) [19:16:38] (03CR) 10CDanis: [C: 03+1] dumps: Deny Apple IPs [puppet] - 10https://gerrit.wikimedia.org/r/829312 (https://phabricator.wikimedia.org/T317001) (owner: 10Ladsgroup) [19:17:09] 10SRE, 10Data-Services, 10Patch-For-Review: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10Andrew) [19:17:36] 10SRE, 10Data-Services, 10Patch-For-Review: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10Ladsgroup) {P33768} [19:19:54] (03CR) 10Ladsgroup: [C: 03+2] dumps: Deny Apple IPs [puppet] - 10https://gerrit.wikimedia.org/r/829312 (https://phabricator.wikimedia.org/T317001) (owner: 10Ladsgroup) [19:22:35] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10Tgr) `mwstore://local-swift-eqiad/local-public/archive/0/06` at least returns an iterator, but it... [19:23:34] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10Krinkle) @Dzahn LGTM. Thanks for filing those. I would suggest for the hostnames to go with `arclamp#001`. [19:26:46] 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-Incident, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10Yann) I again get the same error while deleting files, i.e. https://commons.wi... [19:28:28] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:39:58] 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-Incident, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10jcrespo) @Yann This issue described on this ticket refers to a very particular... [20:04:44] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:19:42] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:26:58] 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-Incident, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10Yann) OK, thanks for the information. Now this new error seem to be difficult... [20:47:06] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:06] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:09:31] 10SRE, 10Acme-chief, 10Traffic-Icebox: Use acme-chief provided OCSP stapling responses - https://phabricator.wikimedia.org/T232988 (10Aklapper) @Vgutierrez: Could you please answer the last comment? Thanks in advance! [21:34:11] (03PS6) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705) [21:57:26] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro) [22:04:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P33769 and previous config saved to /var/cache/conftool/dbconfig/20220904-220457-ladsgroup.json [22:05:01] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [22:20:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P33770 and previous config saved to /var/cache/conftool/dbconfig/20220904-222004-ladsgroup.json [22:30:56] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method [22:33:24] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [22:35:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P33771 and previous config saved to /var/cache/conftool/dbconfig/20220904-223510-ladsgroup.json [22:45:42] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method [22:48:08] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [22:50:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P33772 and previous config saved to /var/cache/conftool/dbconfig/20220904-225016-ladsgroup.json [22:50:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [22:50:20] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [22:50:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [22:50:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [22:50:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [22:50:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P33773 and previous config saved to /var/cache/conftool/dbconfig/20220904-225044-ladsgroup.json [23:10:00] (03PS1) 10Zabe: varnish: Fix identation [puppet] - 10https://gerrit.wikimedia.org/r/829319 [23:17:49] (03PS1) 10Andrew Bogott: prometheus-openstack-stale-puppet-certs.py: log original cert name [puppet] - 10https://gerrit.wikimedia.org/r/829320 [23:17:51] (03PS1) 10Andrew Bogott: Add clean-stale-puppet-certs script [puppet] - 10https://gerrit.wikimedia.org/r/829321 [23:18:46] (03CR) 10CI reject: [V: 04-1] Add clean-stale-puppet-certs script [puppet] - 10https://gerrit.wikimedia.org/r/829321 (owner: 10Andrew Bogott) [23:20:04] (03PS2) 10Andrew Bogott: Add clean-stale-puppet-certs script [puppet] - 10https://gerrit.wikimedia.org/r/829321 [23:23:24] (03CR) 10Andrew Bogott: "Just in case taavi has not already written a better version of this someplace..." [puppet] - 10https://gerrit.wikimedia.org/r/829321 (owner: 10Andrew Bogott) [23:27:34] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method [23:30:00] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [23:34:14] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:46:56] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P33774 and previous config saved to /var/cache/conftool/dbconfig/20220904-235100-ladsgroup.json [23:51:06] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863