[00:05:26] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:18:06] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:21:26] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:31:22] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:00:26] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:02:46] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:00:10] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:01:48] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:58] <wikibugs>	 (03PS1) 10Andrew Bogott: backy2: correct pool name for glance image backups [puppet] - 10https://gerrit.wikimedia.org/r/829294 (https://phabricator.wikimedia.org/T316980)
[02:18:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] backy2: correct pool name for glance image backups [puppet] - 10https://gerrit.wikimedia.org/r/829294 (https://phabricator.wikimedia.org/T316980) (owner: 10Andrew Bogott)
[02:34:30] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:53:44] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:57:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:17:36] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:41:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:46:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:55:34] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:02:00] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method
[06:04:24] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[06:09:43] <wikibugs>	 (03PS1) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705)
[06:12:46] <wikibugs>	 (03PS2) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705)
[06:29:44] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:52:18] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:56:52] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[06:59:18] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220904T0700)
[07:24:16] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:26:32] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[07:38:02] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:46:18] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:33:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[08:33:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[08:33:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P33760 and previous config saved to /var/cache/conftool/dbconfig/20220904-083341-ladsgroup.json
[08:33:47] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[08:46:32] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:05:36] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:25:03] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::grid: remove legacy host key stuff [puppet] - 10https://gerrit.wikimedia.org/r/829305
[10:30:29] <wikibugs>	 (03PS3) 10Majavah: dynamicproxy: improve /zones API [puppet] - 10https://gerrit.wikimedia.org/r/826986 (https://phabricator.wikimedia.org/T316463)
[10:34:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T314041)', diff saved to https://phabricator.wikimedia.org/P33761 and previous config saved to /var/cache/conftool/dbconfig/20220904-103405-ladsgroup.json
[10:34:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance
[10:34:09] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[10:34:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance
[10:34:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T314041)', diff saved to https://phabricator.wikimedia.org/P33762 and previous config saved to /var/cache/conftool/dbconfig/20220904-103427-ladsgroup.json
[10:35:28] <wikibugs>	 (03CR) 10Majavah: dynamicproxy: improve /zones API (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826986 (https://phabricator.wikimedia.org/T316463) (owner: 10Majavah)
[11:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[11:50:09] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Aklapper) @Kelson: Could you please answer the last comment? Thanks in advance!
[12:18:18] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:48:24] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[12:48:58] <elukey>	 !log pkill remaining processes of user effeietsanders on stat1008 to unblock puppet - T314846
[12:49:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:02] <stashbot>	 T314846: Check home/HDFS leftovers of effeietsanders - https://phabricator.wikimedia.org/T314846
[12:50:45] <elukey>	 !log reset-fail ifup@ens13.service on netflow4002
[12:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:53] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[12:51:24] <elukey>	 !log reset-fail ifup@ens13.service on idp2002
[12:51:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:32] <icinga-wm>	 RECOVERY - Check systemd state on netflow4002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:53:32] <icinga-wm>	 RECOVERY - Check systemd state on idp2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:55:28] <wikibugs>	 (03PS3) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705)
[13:30:16] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[13:32:44] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 12 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[13:47:36] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[13:50:04] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[14:16:20] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10kaldari) @MarkTraceur @Tgr Could someone with shell access at least look in some of the relevant...
[14:55:28] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) @nskaggs @Aklapper Not sure to fully understand the question. The answer has been posted earli...
[15:04:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Marostegui) Can this be resumed?
[15:05:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P33763 and previous config saved to /var/cache/conftool/dbconfig/20220904-150508-ladsgroup.json
[15:05:11] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[15:12:30] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:20:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P33764 and previous config saved to /var/cache/conftool/dbconfig/20220904-152015-ladsgroup.json
[15:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[15:35:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P33765 and previous config saved to /var/cache/conftool/dbconfig/20220904-153521-ladsgroup.json
[15:43:32] <icinga-wm>	 PROBLEM - Host elastic1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:43:32] <icinga-wm>	 PROBLEM - Host elastic1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:45:24] <icinga-wm>	 PROBLEM - Host ml-cache1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:36] <icinga-wm>	 PROBLEM - Host an-presto1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:50:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P33766 and previous config saved to /var/cache/conftool/dbconfig/20220904-155027-ladsgroup.json
[15:50:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[15:50:31] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[15:50:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[15:51:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P33767 and previous config saved to /var/cache/conftool/dbconfig/20220904-155059-ladsgroup.json
[16:07:19] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:15:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:17:10] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash2027 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:20:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:24:08] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:26:42] <icinga-wm>	 PROBLEM - MD RAID on logstash2027 is CRITICAL: CRITICAL: State: degraded, Active: 22, Working: 22, Failed: 2, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[16:26:43] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on logstash2027 is CRITICAL: CRITICAL: State: degraded, Active: 22, Working: 22, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T316996 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[16:26:47] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on logstash2027 - https://phabricator.wikimedia.org/T316996 (10ops-monitoring-bot)
[16:27:19] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:31:10] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] mwscript: Support --force-version flag (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771001 (https://phabricator.wikimedia.org/T303878) (owner: 10Ahmon Dancy)
[16:46:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:51:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:54:54] <wikibugs>	 (03PS4) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705)
[16:56:28] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[16:58:58] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[17:22:16] <wikibugs>	 (03PS5) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705)
[17:50:40] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[17:53:08] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 8 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[18:16:50] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:25:26] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:26:38] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:27:42] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48681 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:28:56] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.612 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:47:45] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[18:47:45] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[18:47:45] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[18:47:50] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[18:58:50] <icinga-wm>	 PROBLEM - Outgoing network saturation on labstore1006 is CRITICAL: 1.129e+09 ge 1.062e+09 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007
[19:02:45] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: (2) Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[19:02:45] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: (2) Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[19:04:27] <icinga-wm>	 ACKNOWLEDGEMENT - Outgoing network saturation on labstore1006 is CRITICAL: 1.171e+09 ge 1.062e+09 Andrew Bogott this seems to be intermittent https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007
[19:06:13] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1006 is OK: (C)1.062e+09 ge (W)9.375e+08 ge 1.15e+08 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007
[19:07:45] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: (2) Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[19:07:45] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: (2) Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[19:10:54] <wikibugs>	 (03PS1) 10Ladsgroup: dumps: Deny Apple IPs [puppet] - 10https://gerrit.wikimedia.org/r/829312
[19:12:45] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[19:12:45] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[19:12:45] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[19:12:50] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[19:15:45] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10Tgr) These aren't directories, they are Swift containers. Our documentation on how to do common f...
[19:15:51] <wikibugs>	 10SRE, 10Data-Services: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10CDanis)
[19:16:15] <wikibugs>	 (03PS2) 10Ladsgroup: dumps: Deny Apple IPs [puppet] - 10https://gerrit.wikimedia.org/r/829312 (https://phabricator.wikimedia.org/T317001)
[19:16:38] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] dumps: Deny Apple IPs [puppet] - 10https://gerrit.wikimedia.org/r/829312 (https://phabricator.wikimedia.org/T317001) (owner: 10Ladsgroup)
[19:17:09] <wikibugs>	 10SRE, 10Data-Services, 10Patch-For-Review: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10Andrew)
[19:17:36] <wikibugs>	 10SRE, 10Data-Services, 10Patch-For-Review: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10Ladsgroup) {P33768}
[19:19:54] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] dumps: Deny Apple IPs [puppet] - 10https://gerrit.wikimedia.org/r/829312 (https://phabricator.wikimedia.org/T317001) (owner: 10Ladsgroup)
[19:22:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10Tgr) `mwstore://local-swift-eqiad/local-public/archive/0/06` at least returns an iterator, but it...
[19:23:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10Krinkle) @Dzahn LGTM. Thanks for filing those. I would suggest for the hostnames to go with `arclamp#001`.
[19:26:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-Incident, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10Yann) I again get the same error while deleting files, i.e. https://commons.wi...
[19:28:28] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:39:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-Incident, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10jcrespo) @Yann This issue described on this ticket refers to a very particular...
[20:04:44] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:19:42] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:26:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-Incident, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10Yann) OK, thanks for the information. Now this new error seem to be difficult...
[20:47:06] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:06:06] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:09:31] <wikibugs>	 10SRE, 10Acme-chief, 10Traffic-Icebox: Use acme-chief provided OCSP stapling responses - https://phabricator.wikimedia.org/T232988 (10Aklapper) @Vgutierrez: Could you please answer the last comment? Thanks in advance!
[21:34:11] <wikibugs>	 (03PS6) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705)
[21:57:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro)
[22:04:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P33769 and previous config saved to /var/cache/conftool/dbconfig/20220904-220457-ladsgroup.json
[22:05:01] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[22:20:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P33770 and previous config saved to /var/cache/conftool/dbconfig/20220904-222004-ladsgroup.json
[22:30:56] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method
[22:33:24] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[22:35:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P33771 and previous config saved to /var/cache/conftool/dbconfig/20220904-223510-ladsgroup.json
[22:45:42] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method
[22:48:08] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[22:50:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T312863)', diff saved to https://phabricator.wikimedia.org/P33772 and previous config saved to /var/cache/conftool/dbconfig/20220904-225016-ladsgroup.json
[22:50:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[22:50:20] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[22:50:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[22:50:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[22:50:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[22:50:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P33773 and previous config saved to /var/cache/conftool/dbconfig/20220904-225044-ladsgroup.json
[23:10:00] <wikibugs>	 (03PS1) 10Zabe: varnish: Fix identation [puppet] - 10https://gerrit.wikimedia.org/r/829319
[23:17:49] <wikibugs>	 (03PS1) 10Andrew Bogott: prometheus-openstack-stale-puppet-certs.py: log original cert name [puppet] - 10https://gerrit.wikimedia.org/r/829320
[23:17:51] <wikibugs>	 (03PS1) 10Andrew Bogott: Add clean-stale-puppet-certs script [puppet] - 10https://gerrit.wikimedia.org/r/829321
[23:18:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add clean-stale-puppet-certs script [puppet] - 10https://gerrit.wikimedia.org/r/829321 (owner: 10Andrew Bogott)
[23:20:04] <wikibugs>	 (03PS2) 10Andrew Bogott: Add clean-stale-puppet-certs script [puppet] - 10https://gerrit.wikimedia.org/r/829321
[23:23:24] <wikibugs>	 (03CR) 10Andrew Bogott: "Just in case taavi has not already written a better version of this someplace..." [puppet] - 10https://gerrit.wikimedia.org/r/829321 (owner: 10Andrew Bogott)
[23:27:34] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method
[23:30:00] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[23:34:14] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:46:56] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:51:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P33774 and previous config saved to /var/cache/conftool/dbconfig/20220904-235100-ladsgroup.json
[23:51:06] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863