[00:04:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:09:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:20:39] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[00:21:31] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:29] <icinga-wm>	 RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[00:24:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:29:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:30:47] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:32:33] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:32:33] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:39:34] <wikibugs>	 (03PS1) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314)
[00:40:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm)
[00:42:33] <wikibugs>	 (03PS2) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314)
[00:43:13] <wikibugs>	 (03PS3) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314)
[00:44:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm)
[00:44:33] <wikibugs>	 (03PS4) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314)
[00:45:41] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35823/console" [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm)
[00:51:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[00:55:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:00:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:04:19] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:05:39] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:06:31] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:08:47] <icinga-wm>	 PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:15:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:20:25] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:22:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:24:21] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:27:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:30:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:35:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:40:27] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:46:15] <wikibugs>	 10SRE, 10Traffic: fawiki user reports getting 503 errors with message "upstream connect error or disconnect before headers" - https://phabricator.wikimedia.org/T310450 (10Bugreporter)
[01:51:47] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:59:29] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:10:05] <icinga-wm>	 RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:15:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[02:15:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[02:15:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:15:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298560)', diff saved to https://phabricator.wikimedia.org/P29629 and previous config saved to /var/cache/conftool/dbconfig/20220613-021511-ladsgroup.json
[02:15:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:15:14] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[02:15:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:21:35] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:25:41] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:28:35] <wikibugs>	 (03PS1) 10Legoktm: WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800
[02:29:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm)
[02:30:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:31:40] <wikibugs>	 (03PS2) 10Legoktm: WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800
[02:36:14] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35824/console" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm)
[02:37:01] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:38:46] <wikibugs>	 (03PS3) 10Legoktm: WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800
[02:40:06] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35825/console" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm)
[02:42:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:47:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:53:41] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:54:12] <wikibugs>	 (03PS4) 10Legoktm: WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800
[02:56:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:58:20] <wikibugs>	 (03PS5) 10Legoktm: WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800
[02:59:45] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35827/console" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm)
[03:01:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:03:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:04:26] <wikibugs>	 (03PS6) 10Legoktm: WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800
[03:07:13] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35828/console" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm)
[03:08:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:09:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:12:07] <wikibugs>	 (03PS7) 10Legoktm: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800
[03:14:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:16:52] <wikibugs>	 (03PS5) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314)
[03:17:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:17:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm)
[03:19:10] <wikibugs>	 (03PS6) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314)
[03:22:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:23:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:23:45] <wikibugs>	 (03CR) 10Legoktm: "The follow-up patch demonstrates the usefulness of this refactor." [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm)
[03:28:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:29:25] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-Parser, and 4 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10Winston_Sung)
[03:32:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:37:19] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:40:49] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:41:03] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:45:49] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:50:48] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:50:54] <legoktm>	 down?
[03:51:06] <AntiComposite>	 upstream connect error or disconnect/reset before headers. reset reason: overflow
[03:51:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[03:52:18] <jinxer-wm>	 (ProbeDown) firing: (5) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:52:18] <jinxer-wm>	 (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:52:35] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[03:52:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[03:52:41] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[03:53:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1441 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1455 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:14] <legoktm>	 took too long to log into klaxon, pages fired on their own
[03:53:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:47] <icinga-wm>	 PROBLEM - Apache HTTP on mw1416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:47] <icinga-wm>	 PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:47] <icinga-wm>	 PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:47] <icinga-wm>	 PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:49] <icinga-wm>	 PROBLEM - Apache HTTP on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:49] <icinga-wm>	 PROBLEM - Apache HTTP on mw1433 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:55] <icinga-wm>	 PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:53:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:54:09] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 120 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:54:11] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[03:54:17] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[03:55:11] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.6575 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[03:55:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1441 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:55:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1455 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:55:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:55:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:55:25] <legoktm>	 [being investigated]
[03:55:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:55:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:55:57] <icinga-wm>	 RECOVERY - Apache HTTP on mw1416 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:55:57] <icinga-wm>	 RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:55:57] <icinga-wm>	 RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:55:57] <icinga-wm>	 RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:55:59] <icinga-wm>	 RECOVERY - Apache HTTP on mw1325 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:56:01] <icinga-wm>	 RECOVERY - Apache HTTP on mw1433 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:56:07] <icinga-wm>	 RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:56:07] <icinga-wm>	 RECOVERY - Apache HTTP on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:56:18] <AntiComposite>	 looks up from here now, a few people in discord reporting the same
[03:56:31] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[03:57:18] <jinxer-wm>	 (ProbeDown) resolved: (23) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:57:19] <jinxer-wm>	 (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:57:21] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[03:57:31] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[03:57:35] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[03:57:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[03:58:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:58:45] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-Parser, and 3 others: Show SVGs in page view language for language variants if available - https://phabricator.wikimedia.org/T310453 (10Winston_Sung)
[03:58:49] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:59:17] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:01:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[04:02:33] <jinxer-wm>	 (ProbeDown) resolved: (23) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:06:50] <rzl>	 everything's back up as far as we can see, continuing to stabilize some things but speak up if you're still having trouble accessing anything <3
[04:14:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:16:36] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-Incident: Unable to view all Wikimedia projects - https://phabricator.wikimedia.org/T310431 (10Liz) This happened again in the past 15 minutes and lasted about 4 or 5 minutes.
[04:18:59] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:19:03] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:20:39] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:21:40] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-Parser, and 4 others: Show SVGs in page view language for language variants if available - https://phabricator.wikimedia.org/T310453 (10PatchDemoBot) Test wiki **created** on [[ https://patchdemo.wmflabs.org | Patch demo ]] by Winston Sung using pat...
[04:22:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:23:55] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 25.97 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[04:24:37] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 47.33 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[04:24:53] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 54.55 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[04:25:47] <mutante>	 !log thumbor2006 - host down - attempting powercycle via DRAC console
[04:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:26:55] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[04:27:11] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 91.84 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[04:28:35] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 87.65 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[04:29:25] <mutante>	 !log thumbor2004 - attempted powercycle via DRAC console
[04:29:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:29:45] <wikibugs>	 (03PS3) 10KartikMistry: Update nodejs -> node command [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256
[04:32:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:32:27] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=thumbor2004.codfw.wmnet
[04:32:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:34:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:34:48] <wikibugs>	 10SRE: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn)
[04:35:08] <wikibugs>	 10SRE: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn) 04:32 <+logmsgbot> !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=thumbor2004.codfw.wmnet
[04:35:26] <wikibugs>	 10SRE: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn)
[04:35:37] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update nodejs -> node command [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256 (owner: 10KartikMistry)
[04:35:45] <icinga-wm>	 ACKNOWLEDGEMENT - Host thumbor2004 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T310455
[04:37:31] <wikibugs>	 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn)
[04:40:00] <wikibugs>	 (03Merged) 10jenkins-bot: Update nodejs -> node command [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256 (owner: 10KartikMistry)
[04:44:56] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=inactive; selector: dc=codfw,name=thumbor2004.codfw.wmnet
[04:44:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:46:06] <wikibugs>	 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn)
[04:46:55] <wikibugs>	 (03PS1) 10Legoktm: mediawiki: Disable useless mostlinkedcategories update job [puppet] - 10https://gerrit.wikimedia.org/r/804803 (https://phabricator.wikimedia.org/T310456)
[04:47:30] <wikibugs>	 (03PS1) 10Legoktm: mediawiki: Remove absented mostlinkedcategories job [puppet] - 10https://gerrit.wikimedia.org/r/804804
[04:49:20] <wikibugs>	 (03PS1) 10Legoktm: Remove misleading "disable" of Special:Mostlinkedcategories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804805 (https://phabricator.wikimedia.org/T310456)
[04:50:24] * kart_ updating cxserver..
[04:50:27] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[04:50:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:50:41] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[04:50:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:51:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:51:48] <wikibugs>	 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn) /admin1-> racadm serveraction powercycle  Server power operation successful  ---  but nothing happens
[04:52:13] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35830/console" [puppet] - 10https://gerrit.wikimedia.org/r/804803 (https://phabricator.wikimedia.org/T310456) (owner: 10Legoktm)
[04:52:55] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:54:06] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[04:54:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:54:41] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[04:54:55] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[04:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:55:54] <wikibugs>	 (03PS1) 10Samwilson: Enable Realtime Preview on cawiki, viwiki, and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804806 (https://phabricator.wikimedia.org/T303961)
[04:56:34] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[04:56:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:57:17] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[04:57:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:59:01] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[04:59:19] <kart_>	 !log Updated cxserver to 2022-06-08-124326-production + nodejs > node command update (T306995, T309169)
[04:59:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:59:23] <stashbot>	 T309169: Set Google as the default translation service when translating to Spanish - https://phabricator.wikimedia.org/T309169
[04:59:23] <stashbot>	 T306995: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995
[05:02:08] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10KartikMistry) Upgrade note: node14 has removed symlink of nodejs -> node command.
[05:05:39] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:06:10] <wikibugs>	 (03PS1) 10Legoktm: mediawiki: Switch sharded_periodic_job to use foreachwikiindblist [puppet] - 10https://gerrit.wikimedia.org/r/804807
[05:07:30] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35831/console" [puppet] - 10https://gerrit.wikimedia.org/r/804807 (owner: 10Legoktm)
[05:12:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:13:58] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[05:13:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[05:14:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:14:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[05:14:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[05:14:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:14:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:14:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:14:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T310011)', diff saved to https://phabricator.wikimedia.org/P29633 and previous config saved to /var/cache/conftool/dbconfig/20220613-051407-marostegui.json
[05:14:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:14:12] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[05:16:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T310011)', diff saved to https://phabricator.wikimedia.org/P29634 and previous config saved to /var/cache/conftool/dbconfig/20220613-051613-marostegui.json
[05:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P29635 and previous config saved to /var/cache/conftool/dbconfig/20220613-053118-marostegui.json
[05:31:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:45:17] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:46:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P29636 and previous config saved to /var/cache/conftool/dbconfig/20220613-054623-marostegui.json
[05:46:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P29637 and previous config saved to /var/cache/conftool/dbconfig/20220613-055557-root.json
[05:56:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:11:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P29638 and previous config saved to /var/cache/conftool/dbconfig/20220613-061101-root.json
[06:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:35] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:21:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:23:25] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:26:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P29639 and previous config saved to /var/cache/conftool/dbconfig/20220613-062605-root.json
[06:26:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:43] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10ayounsi)
[06:29:15] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:36:21] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:41:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P29640 and previous config saved to /var/cache/conftool/dbconfig/20220613-064109-root.json
[06:41:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:21] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220613T0700).
[07:00:04] <jouncebot>	 TheresNoTime: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:16] <wikibugs>	 (03CR) 10Ayounsi: "I don't know enough the venv internals to suggest a better approach." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond)
[07:06:59] <wikibugs>	 (03CR) 10Ayounsi: scap: update venv to use the system ca bundle (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond)
[07:11:00] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:query_service::deploy::autodeploy remove used autodeploy. [puppet] - 10https://gerrit.wikimedia.org/r/803393 (owner: 10Slyngshede)
[07:11:59] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:16:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:18:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Record removed Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/805078
[07:23:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Record removed Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/805078 (owner: 10Muehlenhoff)
[07:24:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove LDAP access for dstrine [puppet] - 10https://gerrit.wikimedia.org/r/805079
[07:28:39] <wikibugs>	 (03PS5) 10Slyngshede: Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673)
[07:30:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for dstrine [puppet] - 10https://gerrit.wikimedia.org/r/805079 (owner: 10Muehlenhoff)
[07:31:05] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:31:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:33:38] <wikibugs>	 (03PS6) 10Slyngshede: Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673)
[07:38:03] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35832/console" [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:41:45] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:54:58] <moritzm>	 !log failover ganeti master in esams to ganeti3003 T308238
[07:55:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:03] <stashbot>	 T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238
[07:55:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff)
[08:00:19] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[08:03:01] <icinga-wm>	 PROBLEM - Host mr1-drmrs is DOWN: PING CRITICAL - Packet loss = 100%
[08:06:27] <wikibugs>	 (03CR) 10Volans: scap: update venv to use the system ca bundle (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond)
[08:06:39] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:06:39] <icinga-wm>	 PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:07:06] <volans>	 XioNoX: ^^^ mr1-drmrs
[08:07:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:13:01] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:14:04] <XioNoX>	 volans: looking
[08:15:44] <XioNoX>	 volans: looks like it died, and of course the console server is only reachable through mgmt :)
[08:15:53] <volans>	 :/
[08:16:16] <XioNoX>	 thanks to parent/child in netbox, only the relevant things alerted
[08:16:22] <XioNoX>	 er, in icinga I mean
[08:16:27] <XioNoX>	 https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?allunhandledproblems&sortobject=services&sorttype=2&sortoption=3&serviceprops=270336&hostprops=270336
[08:16:42] <wikibugs>	 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10SLyngshede-WMF) p:05Triage→03Medium
[08:20:39] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:22:00] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:22:37] <XioNoX>	 ohhh, did we lose a power feed?
[08:22:45] <volans>	 I was about to say the same...
[08:23:20] <icinga-wm>	 PROBLEM - IPMI Sensor Status on lvs6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:23:23] <XioNoX>	 I guess it's easier to fix than a failed router
[08:23:24] <volans>	 yep
[08:23:35] <XioNoX>	 asw1-b13-drmrs> show chassis environment 
[08:23:35] <XioNoX>	 Class Item                           Status     Measurement
[08:23:35] <XioNoX>	 Power FPC 0 Power Supply 0           OK         35 degrees C / 95 degrees F
[08:23:35] <XioNoX>	       FPC 0 Power Supply 1           Present   
[08:23:37] <volans>	 from icinga they are all in soft critical
[08:24:09] <XioNoX>	 was there a planned maintenance?
[08:24:36] <volans>	 I can't see one in the calendar
[08:24:42] <XioNoX>	 thanks
[08:24:44] <icinga-wm>	 PROBLEM - IPMI Sensor Status on dns6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:25:50] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:25:50] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:27:40] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:28:06] <icinga-wm>	 PROBLEM - IPMI Sensor Status on lvs6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:28:14] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:29:02] <XioNoX>	 Please be advised that all equipment connected to only one feed could lose power.
[08:29:02] <XioNoX>	 We advise you to check that your equipment is connected to both feeds provided and/or to automatic source inverters in the case it is only connected to a single feed. 
[08:29:29] <XioNoX>	 the mrs2 notifications don't go to maint-announce
[08:29:30] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:29:40] <volans>	 XioNoX: and where they go?
[08:29:53] <XioNoX>	 volans: noreply-notifications@interxion.com :)
[08:29:56] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:30:02] <volans>	 :/
[08:30:16] <XioNoX>	 at least me, and probably some of DCops
[08:30:59] <XioNoX>	 btw:
[08:31:05] <XioNoX>	 Time Start: 13 June 2022  09:00  Local time
[08:31:05] <XioNoX>	 Time End: 13 June 2022   18:00 Local time
[08:31:10] <XioNoX>	 so all day
[08:31:27] <volans>	 hopefully less than that
[08:32:09] <wikibugs>	 (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/804306 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche)
[08:33:06] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:34:31] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) Thanks for the update; I think the ILO needs our local configuration re-applying to it? If so, are you OK to do that, please?
[08:34:32] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6013 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:34:34] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:34:42] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:35:48] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:35:58] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon)
[08:36:22] <icinga-wm>	 PROBLEM - IPMI Sensor Status on lvs6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:39:30] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti6004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:40:38] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6014 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:41:18] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:42:24] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:43:00] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:46:52] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6016 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:46:52] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:48:18] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6015 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:51:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[08:51:28] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:55:22] <icinga-wm>	 PROBLEM - IPMI Sensor Status on dns6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:58:30] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:03:52] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:05:39] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:07:11] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: Added taskircmail, ircmail and pagetaskircmail routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro)
[09:07:20] <moritzm>	 !log installing ntfs-3g security updates
[09:07:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:28] <wikibugs>	 (03PS12) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings [puppet] - 10https://gerrit.wikimedia.org/r/802040
[09:11:20] <wikibugs>	 10ops-drmrs: drmrs 1/2 power feed down due to maintenance - https://phabricator.wikimedia.org/T310470 (10ayounsi) p:05Triage→03High
[09:11:47] <wikibugs>	 10ops-drmrs: drmrs 1/2 power feed down due to maintenance - https://phabricator.wikimedia.org/T310470 (10ayounsi)
[09:12:16] <moritzm>	 !log drain ganeti3001 for firmware update/reimage T308238
[09:12:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:19] <stashbot>	 T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238
[09:12:46] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-b13-drmrs-infeed-load-tower-B-single-phase on ps1-b13-drmrs is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:12:46] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-b13-drmrs-infeed-load-tower-A-single-phase on ps1-b13-drmrs is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:12:46] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-b12-drmrs-infeed-load-tower-B-single-phase on ps1-b12-drmrs is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:12:46] <icinga-wm>	 ACKNOWLEDGEMENT - ps1-b12-drmrs-infeed-load-tower-A-single-phase on ps1-b12-drmrs is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:12:46] <icinga-wm>	 ACKNOWLEDGEMENT - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00.
[09:14:26] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:26] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:26] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:26] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:26] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:27] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:27] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:28] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:28] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:29] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:29] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:30] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:30] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6013 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:31] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6014 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:31] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6015 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:32] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp6016 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:32] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on dns6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:33] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on dns6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:33] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:34] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:34] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:35] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti6004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:35] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on lvs6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:36] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on lvs6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:14:36] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on lvs6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:16:21] <wikibugs>	 (03PS5) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251
[09:19:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond)
[09:20:15] <wikibugs>	 (03CR) 10Volans: "Do we need a new cookbook? can't we just extend the sre.hosts.dhcp one for this use case?" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond)
[09:21:36] <wikibugs>	 (03CR) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro)
[09:25:50] <wikibugs>	 (03PS6) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251
[09:29:58] <wikibugs>	 (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond)
[09:43:20] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:45:34] <wikibugs>	 (03PS7) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251
[09:46:23] <wikibugs>	 (03CR) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond)
[09:48:55] <wikibugs>	 10ops-eqiad, 10DC-Ops: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T310478 (10MatthewVernon) p:05Triage→03High
[09:50:12] <wikibugs>	 10ops-eqiad, 10DC-Ops: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T310478 (10MatthewVernon)
[09:50:14] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon)
[09:50:38] <wikibugs>	 (03PS1) 10Slyngshede: LDAP sync. [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385)
[09:51:11] <wikibugs>	 (03PS9) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team [puppet] - 10https://gerrit.wikimedia.org/r/802074
[09:51:13] <wikibugs>	 (03CR) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro)
[09:51:21] <wikibugs>	 (03PS3) 10David Caro: alertmanager.yml.erb: use facts directly instead of lookupvar [puppet] - 10https://gerrit.wikimedia.org/r/802489
[09:51:45] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal, 10Patch-For-Review: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10jcrespo) > Do whatever is the least effort from your end that still preserves something  Thank you a lot, that...
[09:52:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:52:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] LDAP sync. [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385) (owner: 10Slyngshede)
[09:52:59] <wikibugs>	 (03PS3) 10Muehlenhoff: Switch idp1001/idp2001 to role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/803892 (https://phabricator.wikimedia.org/T308214)
[09:53:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[09:53:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[09:53:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:15] <wikibugs>	 (03PS2) 10Slyngshede: LDAP sync. [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385)
[09:54:23] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans)
[09:54:31] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] alertmanager.yml.erb: use facts directly instead of lookupvar [puppet] - 10https://gerrit.wikimedia.org/r/802489 (owner: 10David Caro)
[09:54:51] <wikibugs>	 (03CR) 10jenkins-bot: LDAP sync. [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385) (owner: 10Slyngshede)
[09:56:23] <wikibugs>	 (03PS3) 10Slyngshede: LDAP sync. [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385)
[09:58:03] <wikibugs>	 (03PS2) 10Jbond: scap: update venv to use the system ca bundle [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572
[09:58:10] <wikibugs>	 (03CR) 10Jbond: scap: update venv to use the system ca bundle (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond)
[10:01:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385) (owner: 10Slyngshede)
[10:10:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803943 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:13:33] <moritzm>	 !log installing 5.10.120 kernel updates on bullseye hosts
[10:13:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:31] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[10:15:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[10:15:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T310011)', diff saved to https://phabricator.wikimedia.org/P29641 and previous config saved to /var/cache/conftool/dbconfig/20220613-101537-marostegui.json
[10:15:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:41] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[10:16:58] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:37:43] <wikibugs>	 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui)
[10:37:50] <wikibugs>	 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) p:05Triage→03High
[10:37:55] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:37:58] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:37:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:17] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:38:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:20] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:38:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:12] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] LDAP sync. [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385) (owner: 10Slyngshede)
[10:41:37] <wikibugs>	 (03PS1) 10Marostegui: dbproxy2*: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805095 (https://phabricator.wikimedia.org/T310484)
[10:43:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy2*: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805095 (https://phabricator.wikimedia.org/T310484) (owner: 10Marostegui)
[10:44:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T310011)', diff saved to https://phabricator.wikimedia.org/P29642 and previous config saved to /var/cache/conftool/dbconfig/20220613-104449-marostegui.json
[10:44:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:54] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[10:45:28] <wikibugs>	 (03PS1) 10Marostegui: x2 databases: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805096 (https://phabricator.wikimedia.org/T310485)
[10:47:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] x2 databases: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805096 (https://phabricator.wikimedia.org/T310485) (owner: 10Marostegui)
[10:50:20] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:50:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:23] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:36] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:50:42] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:50:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:11] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:51:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724 (10MoritzMuehlenhoff)
[10:51:56] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:51:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:05] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:52:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:26] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:52:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:30] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:52:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] scap: update venv to use the system ca bundle [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond)
[10:56:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Dsharpe out of all services on: 609 hosts
[10:56:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Dsharpe out of all services on: 609 hosts
[10:56:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P29643 and previous config saved to /var/cache/conftool/dbconfig/20220613-105954-marostegui.json
[10:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:15] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetboard2002.codfw.wmnet
[11:00:17] <icinga-wm>	 RECOVERY - IPMI Sensor Status on dns6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:00:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Dsharpe out of all services on: 1219 hosts
[11:00:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:50] <icinga-wm>	 RECOVERY - IPMI Sensor Status on dns6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:01:57] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:02:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Dsharpe out of all services on: 1219 hosts
[11:02:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:48] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:03:59] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2002.codfw.wmnet
[11:04:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:16] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetboard1002.eqiad.wmnet
[11:04:16] <icinga-wm>	 RECOVERY - IPMI Sensor Status on lvs6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:04:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:20] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:04:51] <wikibugs>	 (03PS1) 10MMandere: smokeping: Temp mute smokeping for host lvs6001 [puppet] - 10https://gerrit.wikimedia.org/r/805098 (https://phabricator.wikimedia.org/T310470)
[11:05:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10SLyngshede-WMF) a:03elukey
[11:05:36] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:05:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] smokeping: Temp mute smokeping for host lvs6001 [puppet] - 10https://gerrit.wikimedia.org/r/805098 (https://phabricator.wikimedia.org/T310470) (owner: 10MMandere)
[11:06:00] <icinga-wm>	 RECOVERY - Host mr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.46 ms
[11:06:02] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:06:25] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] smokeping: Temp mute smokeping for host lvs6001 [puppet] - 10https://gerrit.wikimedia.org/r/805098 (https://phabricator.wikimedia.org/T310470) (owner: 10MMandere)
[11:07:14] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki2002.codfw.wmnet
[11:07:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:59] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1002.eqiad.wmnet
[11:08:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:17] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet
[11:08:19] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:08:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:22] <icinga-wm>	 RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 87.56 ms
[11:08:22] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.62 ms
[11:08:37] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti6003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:09:06] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:10:27] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet
[11:10:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:31] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:10:36] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6013 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:10:40] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6011 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:10:48] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:10:56] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui)
[11:11:30] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host people2002.codfw.wmnet
[11:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:52] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6012 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:12:24] <icinga-wm>	 RECOVERY - IPMI Sensor Status on lvs6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:12:30] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki2002.codfw.wmnet
[11:12:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:55] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2002.codfw.wmnet
[11:12:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:14:52] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2002.codfw.wmnet
[11:14:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P29644 and previous config saved to /var/cache/conftool/dbconfig/20220613-111459-marostegui.json
[11:15:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:12] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2002.codfw.wmnet
[11:15:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:15:34] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox2002.codfw.wmnet
[11:15:34] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti6004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:15:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29645 and previous config saved to /var/cache/conftool/dbconfig/20220613-111621-root.json
[11:16:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:42] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6014 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:17:28] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:18:14] <marostegui>	 !log Reboot db1131 for kernel upgrade T310485
[11:18:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:35] <logmsgbot>	 !log jbond@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=netbox,name=codfw
[11:18:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:46] <marostegui>	 !log Reboot x2 hosts for kernel upgrade T310485
[11:18:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:04] <logmsgbot>	 !log jbond@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=netbox,name=eqiad
[11:19:04] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:19:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:08] <logmsgbot>	 !log jbond@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=netbox,name=eqiad
[11:19:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:22] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox2002.codfw.wmnet
[11:19:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:41] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox1002.eqiad.wmnet
[11:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:22:09] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:22:09] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6016 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:23:05] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:23:09] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6015 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:23:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 1%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29646 and previous config saved to /var/cache/conftool/dbconfig/20220613-112356-root.json
[11:23:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:26] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host netbox1002.eqiad.wmnet
[11:24:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:38] <logmsgbot>	 !log jbond@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=netbox
[11:24:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:41] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-netbox.state on authdns2001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-netbox.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[11:24:49] <logmsgbot>	 !log jbond@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=netbox,name=codfw
[11:24:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:53] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-netbox.state on dns3002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-netbox.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[11:25:30] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp-test1002.wikimedia.org
[11:25:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:43] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2002.wikimedia.org
[11:25:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:23] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:27:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:27:39] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1002.wikimedia.org
[11:27:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:51] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp2002.wikimedia.org
[11:27:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:55] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2002.wikimedia.org
[11:27:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:07] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:28:29] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2002.codfw.wmnet
[11:28:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:23] <wikibugs>	 (03PS1) 10Jbond: ido: failover to preform reboot [dns] - 10https://gerrit.wikimedia.org/r/805107 (https://phabricator.wikimedia.org/T310483)
[11:29:37] <icinga-wm>	 RECOVERY - IPMI Sensor Status on lvs6003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[11:29:54] <wikibugs>	 (03PS2) 10Jbond: idp: failover to preform reboot [dns] - 10https://gerrit.wikimedia.org/r/805107 (https://phabricator.wikimedia.org/T310483)
[11:30:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T310011)', diff saved to https://phabricator.wikimedia.org/P29647 and previous config saved to /var/cache/conftool/dbconfig/20220613-113004-marostegui.json
[11:30:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[11:30:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[11:30:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:10] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[11:30:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:04] <wikibugs>	 (03PS1) 10David Caro: Use our own alert managing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805108 (https://phabricator.wikimedia.org/T309789)
[11:31:13] <Mitar>	 jbond: are you here? could we merge https://phabricator.wikimedia.org/T301104?
[11:31:23] <jbond>	 Mitar: hi and yes one sec
[11:31:42] <jbond>	 sorry i missed yuo friday
[11:31:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Add page metadata to Wikibase JSON dumps [puppet] - 10https://gerrit.wikimedia.org/r/802921 (https://phabricator.wikimedia.org/T301104) (owner: 10Mitar)
[11:32:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:34:36] <jbond>	 Mitar: merged and deployed ot all the snapshot machines
[11:35:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use our own alert managing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805108 (https://phabricator.wikimedia.org/T309789) (owner: 10David Caro)
[11:35:31] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host idp2002.wikimedia.org
[11:35:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:41] <icinga-wm>	 PROBLEM - Check systemd state on idp2002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:35:58] * jbond looking
[11:36:08] <jbond>	 .. at idp2002
[11:36:17] <logmsgbot>	 !log jbond@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host netbox-dev2002.codfw.wmnet
[11:36:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:53] <Mitar>	 awesome, thanks!
[11:37:17] <jbond>	 np
[11:39:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 5%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29648 and previous config saved to /var/cache/conftool/dbconfig/20220613-113900-root.json
[11:39:03] <icinga-wm>	 RECOVERY - Check systemd state on idp2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:05] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy2*: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/804710
[11:39:19] <wikibugs>	 (03PS1) 10Marostegui: Revert "x2 databases: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/804711
[11:40:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy2*: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/804710 (owner: 10Marostegui)
[11:42:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:44:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "x2 databases: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/804711 (owner: 10Marostegui)
[11:47:53] <wikibugs>	 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) Active proxies: ` # for i in m1 m2 m3 m5; do host $i-master | grep alias ;done m1-master.eqiad.wmnet is an alias for dbproxy1012.eqiad.wmnet. m2-master.eqiad.wmnet is an alias for dbproxy10...
[11:52:19] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:52:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[11:52:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[11:52:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T310011)', diff saved to https://phabricator.wikimedia.org/P29649 and previous config saved to /var/cache/conftool/dbconfig/20220613-115238-marostegui.json
[11:52:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:41] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[11:54:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 10%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29650 and previous config saved to /var/cache/conftool/dbconfig/20220613-115404-root.json
[11:54:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:18] <wikibugs>	 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui)
[11:54:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/805107 (https://phabricator.wikimedia.org/T310483) (owner: 10Jbond)
[11:54:26] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:54:31] <wikibugs>	 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui)
[11:56:46] <wikibugs>	 10SRE: an-tool1005 - memcached Connection refused - https://phabricator.wikimedia.org/T309886 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF Memcache was restarted by @elukey on Mon 2022-06-06 06:30:42 UTC
[11:58:11] <wikibugs>	 (03PS4) 10Muehlenhoff: Switch idp1001/idp2001 to role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/803892 (https://phabricator.wikimedia.org/T308214)
[11:58:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:58:20] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m1 and m2 [dns] - 10https://gerrit.wikimedia.org/r/805114 (https://phabricator.wikimedia.org/T310484)
[12:00:57] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Request to create new mailing lists for Chinese Wikipedia Administrators - https://phabricator.wikimedia.org/T310465 (10SLyngshede-WMF) p:05Triage→03Low We just need to clarify if there's an approval process for requesting new mailing lists. I'll try to...
[12:02:39] <XioNoX>	 looks like drmrs recovered
[12:03:19] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:07:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti3001.esams.wmnet with reason: Remove from cluster for firmware update and eventual reimage
[12:07:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti3001.esams.wmnet with reason: Remove from cluster for firmware update and eventual reimage
[12:07:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29651 and previous config saved to /var/cache/conftool/dbconfig/20220613-120907-root.json
[12:09:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:02] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Request to create new mailing lists for Chinese Wikipedia Administrators - https://phabricator.wikimedia.org/T310465 (10SLyngshede-WMF) 05Open→03In progress p:05Low→03High
[12:16:24] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:17:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH ganeti3001 is removed from the cluster, downtimed and needs the same firmware/NIC updates to enable the reimage to Bullseye.
[12:19:18] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10Ottomata) I think the bigtop15 .deb packages can/should just be copied to bullsye?  https://apt.wikimedia.org/wikimedia/pool/thirdparty/b...
[12:19:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T310011)', diff saved to https://phabricator.wikimedia.org/P29652 and previous config saved to /var/cache/conftool/dbconfig/20220613-121949-marostegui.json
[12:19:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:54] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[12:20:41] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:24:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29653 and previous config saved to /var/cache/conftool/dbconfig/20220613-122411-root.json
[12:24:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T310478 (10MatthewVernon) 05Open→03Resolved a:05Cmjohnson→03MatthewVernon This turned out to be an incorrect config section - I've updated https://wikitech.wikim...
[12:25:16] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon)
[12:25:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] devices: override default timeout for mgmt routers [homer/public] - 10https://gerrit.wikimedia.org/r/799381 (owner: 10Volans)
[12:26:26] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T310478 (10MatthewVernon)
[12:27:04] <wikibugs>	 (03CR) 10Ayounsi: "I think this can be abandoned as we're not going with 2 SCAP repositories anymore." [puppet] - 10https://gerrit.wikimedia.org/r/789635 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond)
[12:29:19] <wikibugs>	 (03PS1) 10Btullis: Failover hive to the standby server [dns] - 10https://gerrit.wikimedia.org/r/805119 (https://phabricator.wikimedia.org/T309526)
[12:29:45] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1045.eqiad.wmnet
[12:29:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: failover to preform reboot [dns] - 10https://gerrit.wikimedia.org/r/805107 (https://phabricator.wikimedia.org/T310483) (owner: 10Jbond)
[12:30:56] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10MoritzMuehlenhoff) >>! In T310451#7998402, @Ottomata wrote: > I think the bigtop15 .deb packages can/should just be copied to bullsye?  I...
[12:31:06] <wikibugs>	 (03Abandoned) 10Jbond: O:netbox::standalone: use netbox-next/deploy scap repo [puppet] - 10https://gerrit.wikimedia.org/r/789635 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond)
[12:31:50] <wikibugs>	 (03PS2) 10Btullis: Failover hive to the standby server [dns] - 10https://gerrit.wikimedia.org/r/805119 (https://phabricator.wikimedia.org/T309526)
[12:33:14] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10Ottomata) Ah, okay!
[12:33:52] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1045.eqiad.wmnet
[12:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P29654 and previous config saved to /var/cache/conftool/dbconfig/20220613-123454-marostegui.json
[12:34:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:31] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/805121 (https://phabricator.wikimedia.org/T300471)
[12:38:52] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/805121 (https://phabricator.wikimedia.org/T300471)
[12:39:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29655 and previous config saved to /var/cache/conftool/dbconfig/20220613-123915-root.json
[12:39:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:28] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/805121 (https://phabricator.wikimedia.org/T300471) (owner: 10Marostegui)
[12:42:56] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Failover hive to the standby server [dns] - 10https://gerrit.wikimedia.org/r/805119 (https://phabricator.wikimedia.org/T309526) (owner: 10Btullis)
[12:46:17] <wikibugs>	 (03PS3) 10Elukey: admin: add ml-team-admins to ores-admin by default [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044)
[12:48:31] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Request to create new mailing lists for Chinese Wikipedia Administrators - https://phabricator.wikimedia.org/T310465 (10SLyngshede-WMF) 05In progress→03Resolved a:03SLyngshede-WMF Mailing list have been created, but please check that you have access vi...
[12:49:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch idp1001/idp2001 to role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/803892 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[12:50:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P29657 and previous config saved to /var/cache/conftool/dbconfig/20220613-124959-marostegui.json
[12:50:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:58] <wikibugs>	 (03CR) 10Ayounsi: "I haven't done a deep review of the python side, but the logic sgtm!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans)
[12:51:01] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp1002.wikimedia.org
[12:51:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:05] <wikibugs>	 (03PS1) 10Kosta Harlan: NewcomerTasksStore: update quality gate config when the task queue is set [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/804712 (https://phabricator.wikimedia.org/T309768)
[12:51:29] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] "Backport" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/804712 (https://phabricator.wikimedia.org/T309768) (owner: 10Kosta Harlan)
[12:53:10] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1002.wikimedia.org
[12:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1057.eqiad.wmnet with OS bullseye
[12:54:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:08] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1057.eqiad.wmnet with OS bullseye
[12:54:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29658 and previous config saved to /var/cache/conftool/dbconfig/20220613-125419-root.json
[12:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:56:25] <wikibugs>	 (03CR) 10Volans: "reply inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans)
[12:57:41] <wikibugs>	 (03PS1) 10Jbond: Revert "idp: failover to preform reboot" [dns] - 10https://gerrit.wikimedia.org/r/804713
[12:57:47] <wikibugs>	 (03PS2) 10Jbond: Revert "idp: failover to preform reboot" [dns] - 10https://gerrit.wikimedia.org/r/804713
[12:57:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet
[12:57:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220613T1300).
[13:00:05] <jouncebot>	 kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:12] <urbanecm>	 o/
[13:00:26] <kostajh>	 \o
[13:00:27] <urbanecm>	 hi kostajh, do you want to self-serve, or should I deploy?
[13:00:42] <kostajh>	 if it's convenient for you to do it, please do. Otherwise I don't mind
[13:00:49] <wikibugs>	 (03CR) 10Muehlenhoff: "I don't think we even need specifically fall back, one IDP node is a good as the other. For all past maintenances the failed over server s" [dns] - 10https://gerrit.wikimedia.org/r/804713 (owner: 10Jbond)
[13:00:56] <Lucas_WMDE>	 o/
[13:00:57] <kostajh>	 urbanecm: ^
[13:01:07] <urbanecm>	 okay okay. I'll ping you once it's at the debug host kostajh :)
[13:01:27] <Lucas_WMDE>	 TheresNoTime: you had two patches in the morning window that apparently weren’t deployed, do you want to reschedule them? :)
[13:01:42] <kostajh>	 urbanecm: cheers
[13:02:35] <wikibugs>	 (03Abandoned) 10Jbond: Revert "idp: failover to preform reboot" [dns] - 10https://gerrit.wikimedia.org/r/804713 (owner: 10Jbond)
[13:03:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet
[13:03:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T310011)', diff saved to https://phabricator.wikimedia.org/P29659 and previous config saved to /var/cache/conftool/dbconfig/20220613-130504-marostegui.json
[13:05:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[13:05:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[13:05:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:11] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[13:05:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T310011)', diff saved to https://phabricator.wikimedia.org/P29660 and previous config saved to /var/cache/conftool/dbconfig/20220613-130512-marostegui.json
[13:05:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:16] <wikibugs>	 (03PS1) 10Jbond: hieradata:  netbox1001 to specify netbox1002 as the active server. [puppet] - 10https://gerrit.wikimedia.org/r/805125 (https://phabricator.wikimedia.org/T296452)
[13:05:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:06:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "svg should be optimized with svgo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar)
[13:06:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "svg should be optimized with svgo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar)
[13:07:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35834/console" [puppet] - 10https://gerrit.wikimedia.org/r/805125 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond)
[13:09:22] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805125 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond)
[13:10:30] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet
[13:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] hieradata:  netbox1001 to specify netbox1002 as the active server. [puppet] - 10https://gerrit.wikimedia.org/r/805125 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond)
[13:12:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 12 hosts with reason: reboots
[13:12:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:28] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1001.eqiad.wmnet
[13:12:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 12 hosts with reason: reboots
[13:12:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:07] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1001.eqiad.wmnet
[13:13:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:09] <kostajh>	 urbanecm: no... selenium failed for Minerva :( 
[13:14:14] <urbanecm>	 :(
[13:14:15] <wikibugs>	 (03PS4) 10Samtar: crhwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431)
[13:14:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] NewcomerTasksStore: update quality gate config when the task queue is set [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/804712 (https://phabricator.wikimedia.org/T309768) (owner: 10Kosta Harlan)
[13:14:49] <urbanecm>	 kostajh: since it's selenium, perhaps let's re-run?
[13:15:01] <kostajh>	 urbanecm: yeah, it's a random failure. Or is force merge acceptable in this situation?
[13:15:20] <urbanecm>	 i try to avoid force merges as much as possible
[13:15:33] <wikibugs>	 (03Merged) 10jenkins-bot: NewcomerTasksStore: update quality gate config when the task queue is set [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/804712 (https://phabricator.wikimedia.org/T309768) (owner: 10Kosta Harlan)
[13:15:41] * urbanecm is confused
[13:15:42] <wikibugs>	 (03PS4) 10Samtar: ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431)
[13:15:48] <urbanecm>	 oh
[13:15:49] <wikibugs>	 (03CR) 10Volans: "Did a first full pass." [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[13:15:53] <urbanecm>	 main failed, gate succeeded
[13:16:07] <kostajh>	 ha
[13:16:13] <kostajh>	 I didn't think that was possible with gerrit, but ok
[13:16:49] <urbanecm>	 kostajh: pulled to mwdebug1001. can you check please?
[13:17:14] <kostajh>	 urbanecm: checking
[13:17:14] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:18:33] <kostajh>	 urbanecm: looks good to me
[13:18:37] <urbanecm>	 syncing :)
[13:20:42] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[13:20:47] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1001.eqiad.wmnet
[13:20:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:13] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1002.eqiad.wmnet
[13:21:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:09] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1057.eqiad.wmnet with reason: host reimage
[13:22:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:56] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.15/extensions/GrowthExperiments/modules/ext.growthExperiments.DataStore/NewcomerTasksStore.js: 67a5352b0bf9f6aa160cc93a42ca22a02aad883a: NewcomerTasksStore: update quality gate config when the task queue is set (T309768) (duration: 03m 41s)
[13:23:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:01] <stashbot>	 T309768: TypeError: Cannot read properties of undefined (reading 'dailyLimit') - https://phabricator.wikimedia.org/T309768
[13:23:07] <urbanecm>	 kostajh: and done. anything else?
[13:23:19] <kostajh>	 urbanecm: no, thank you very much!
[13:23:22] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1002.eqiad.wmnet
[13:23:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:25] <urbanecm>	 happy to help!
[13:24:26] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[13:24:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:11] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1003.eqiad.wmnet
[13:25:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:18] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1057.eqiad.wmnet with reason: host reimage
[13:25:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:10] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[13:26:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:16] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:27:04] <wikibugs>	 (03CR) 10Jforrester: "❤️" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800680 (owner: 10Ladsgroup)
[13:27:07] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1003.eqiad.wmnet
[13:27:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet
[13:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:12] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul) p:05Triage→03Medium
[13:29:30] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Papaul) p:05Triage→03Medium
[13:29:48] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul) p:05Triage→03Medium
[13:31:02] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin: add ml-team-admins to ores-admin by default [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044) (owner: 10Elukey)
[13:31:11] <wikibugs>	 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Papaul) a:03Papaul
[13:31:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet
[13:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on dse-k8s-worker[1001-1004].eqiad.wmnet with reason: reboots
[13:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dse-k8s-worker[1001-1004].eqiad.wmnet with reason: reboots
[13:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey) 05Open→03Resolved
[13:32:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T310011)', diff saved to https://phabricator.wikimedia.org/P29662 and previous config saved to /var/cache/conftool/dbconfig/20220613-133239-marostegui.json
[13:32:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:44] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[13:35:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:35:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:00] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[13:39:01] <wikibugs>	 (03PS9) 10Eevans: Configure AQS Cassandra hosts (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801)
[13:39:18] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:40:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:40:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:40:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1089 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T310387 (10SLyngshede-WMF) p:05Triage→03Medium
[13:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:38] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1046.eqiad.wmnet
[13:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:01] <wikibugs>	 (03PS2) 10David Caro: Use our own alert managing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805108 (https://phabricator.wikimedia.org/T309789)
[13:43:28] <wikibugs>	 (03PS3) 10David Caro: Use our own alert managing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805108 (https://phabricator.wikimedia.org/T309789)
[13:43:59] <wikibugs>	 (03PS1) 10Klausman: ml-staging-codfw: Add override for cert names [deployment-charts] - 10https://gerrit.wikimedia.org/r/805127 (https://phabricator.wikimedia.org/T302195)
[13:44:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:44:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet
[13:45:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:53] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1046.eqiad.wmnet
[13:45:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P29663 and previous config saved to /var/cache/conftool/dbconfig/20220613-134744-marostegui.json
[13:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:54] <wikibugs>	 (03CR) 10Ori: [C: 03+2] service::docker: refresh service when config file is changed [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori)
[13:48:59] <wikibugs>	 (03PS4) 10Ori: service::docker: refresh service when config file is changed [puppet] - 10https://gerrit.wikimedia.org/r/799420
[13:49:38] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[13:50:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10dcaro) +1 on moving the cloudcephosd hosts, should have no problem as long as it's done one by one.
[13:50:44] <icinga-wm>	 PROBLEM - Check systemd state on karapace1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:50:58] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1057.eqiad.wmnet with OS bullseye
[13:51:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:08] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1057.eqiad.wmnet with OS bullseye completed: - ms-be1057 (**PASS**)   - Downtim...
[13:51:21] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] Configure AQS Cassandra hosts (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans)
[13:52:57] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] Configure AQS Cassandra hosts (codfw) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans)
[13:53:21] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "The CI diff looks good to me!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/805127 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[13:53:26] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] puppetmaster: update private repo pre-commit to error un-staged (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803560 (owner: 10Jbond)
[13:54:10] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-staging-codfw: Add override for cert names [deployment-charts] - 10https://gerrit.wikimedia.org/r/805127 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[13:55:45] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host dumpsdata1007.eqiad.wmnet
[13:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[13:57:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:12] <wikibugs>	 (03Merged) 10jenkins-bot: ml-staging-codfw: Add override for cert names [deployment-charts] - 10https://gerrit.wikimedia.org/r/805127 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[13:58:25] <icinga-wm>	 PROBLEM - Host ganeti6003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:58:25] <icinga-wm>	 PROBLEM - Host ganeti6001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:58:25] <icinga-wm>	 PROBLEM - Host ganeti6004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:58:25] <icinga-wm>	 PROBLEM - Host ganeti6002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:58:47] <icinga-wm>	 PROBLEM - Host lvs6001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:58:47] <icinga-wm>	 PROBLEM - Host lvs6002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:58:47] <icinga-wm>	 PROBLEM - Host lvs6003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:58:57] <icinga-wm>	 PROBLEM - Host asw1-b12-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:59:19] <sukhe>	 ^ expected?
[13:59:21] <icinga-wm>	 PROBLEM - Host scs-drmrs is DOWN: PING CRITICAL - Packet loss = 100%
[13:59:27] <icinga-wm>	 PROBLEM - Host dns6001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:59:27] <icinga-wm>	 PROBLEM - Host dns6002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:59:35] <icinga-wm>	 PROBLEM - Host asw1-b13-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:59:41] <icinga-wm>	 PROBLEM - Host ps1-b12-drmrs is DOWN: PING CRITICAL - Packet loss = 100%
[13:59:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 59, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:59:55] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] wmnet: Failover m1 and m2 [dns] - 10https://gerrit.wikimedia.org/r/805114 (https://phabricator.wikimedia.org/T310484) (owner: 10Marostegui)
[14:00:03] <icinga-wm>	 PROBLEM - Host cp6001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:00:09] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:12] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:00:13] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:00:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:33] <icinga-wm>	 PROBLEM - Router interfaces on mr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.130, interfaces up: 32, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:00:33] <icinga-wm>	 RECOVERY - Check systemd state on karapace1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:18] <wikibugs>	 (03CR) 10Ayounsi: Netbox Ganeti sync: add groups support (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans)
[14:01:31] <icinga-wm>	 PROBLEM - Host ps1-b13-drmrs is DOWN: PING CRITICAL - Packet loss = 100%
[14:01:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
[14:01:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:06] <wikibugs>	 (03PS1) 10Btullis: Fail back Hive services to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/805132 (https://phabricator.wikimedia.org/T309526)
[14:02:25] <icinga-wm>	 PROBLEM - Host cp6002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:25] <icinga-wm>	 PROBLEM - Host cp6003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:25] <icinga-wm>	 PROBLEM - Host cp6004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:25] <icinga-wm>	 PROBLEM - Host cp6005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:25] <icinga-wm>	 PROBLEM - Host cp6006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:26] <icinga-wm>	 PROBLEM - Host cp6007.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:27] <icinga-wm>	 PROBLEM - Host cp6008.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:27] <icinga-wm>	 PROBLEM - Host cp6009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:27] <icinga-wm>	 PROBLEM - Host cp6010.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:28] <icinga-wm>	 PROBLEM - Host cp6012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:28] <icinga-wm>	 PROBLEM - Host cp6011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:29] <icinga-wm>	 PROBLEM - Host cp6013.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:29] <icinga-wm>	 PROBLEM - Host cp6014.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:30] <icinga-wm>	 PROBLEM - Host cp6015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:30] <icinga-wm>	 PROBLEM - Host cp6016.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:31] <icinga-wm>	 PROBLEM - Host cr2-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:31] <icinga-wm>	 PROBLEM - Host cr1-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P29665 and previous config saved to /var/cache/conftool/dbconfig/20220613-140249-marostegui.json
[14:02:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:45] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Fail back Hive services to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/805132 (https://phabricator.wikimedia.org/T309526) (owner: 10Btullis)
[14:03:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:03:54] <wikibugs>	 (03PS2) 10Ori: admin: steal Giuseppe's docker shortcuts [puppet] - 10https://gerrit.wikimedia.org/r/800122
[14:05:24] * TheresNoTime can't *hear* screaming... so that must be expected downtime :P
[14:05:41] <jbond>	 sukhe: this is related to some schedualed maintence that shouldn't have caused an issue (cc XioNoX )
[14:05:50] <sukhe>	 https://phabricator.wikimedia.org/T310470 (from XioNoX)
[14:05:55] <sukhe>	 thanks jbond
[14:06:08] <sukhe>	 > Since 8am UTC one of the drmrs power feed is down. Only impact is the management router down (and thus the management network).
[14:06:13] <jbond>	 thanks for the link sukhe :)
[14:09:43] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:11:26] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[14:16:49] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/805121 (https://phabricator.wikimedia.org/T300471) (owner: 10Marostegui)
[14:17:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T310011)', diff saved to https://phabricator.wikimedia.org/P29666 and previous config saved to /var/cache/conftool/dbconfig/20220613-141754-marostegui.json
[14:17:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[14:17:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[14:17:58] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[14:18:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T310011)', diff saved to https://phabricator.wikimedia.org/P29667 and previous config saved to /var/cache/conftool/dbconfig/20220613-141802-marostegui.json
[14:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cassandra in analytics@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:18:46] <jayme>	 elukey|klausman is CertManagerCertNotReady on mlstaging expected? Let me know if you need help
[14:20:23] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:20:59] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:21:47] <icinga-wm>	 RECOVERY - Host ganeti6002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.09 ms
[14:21:47] <icinga-wm>	 RECOVERY - Host ganeti6003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms
[14:21:57] <icinga-wm>	 PROBLEM - IPMI Sensor Status on lvs6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:22:15] <icinga-wm>	 RECOVERY - Host lvs6001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms
[14:22:15] <icinga-wm>	 RECOVERY - Host lvs6003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.09 ms
[14:22:37] <klausman>	 jayme: yes, it's expected, I am setting up stuff there now
[14:22:37] <icinga-wm>	 RECOVERY - Host ganeti6001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.07 ms
[14:22:41] <icinga-wm>	 RECOVERY - Host dns6001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms
[14:22:43] <icinga-wm>	 PROBLEM - IPMI Sensor Status on dns6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:22:43] <jayme>	 ack
[14:23:03] <icinga-wm>	 RECOVERY - Host cp6001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms
[14:23:07] <icinga-wm>	 RECOVERY - Host cp6006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms
[14:23:35] <icinga-wm>	 PROBLEM - IPMI Sensor Status on dns6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:24:59] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:25:37] <icinga-wm>	 RECOVERY - Host cp6014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms
[14:25:41] <icinga-wm>	 RECOVERY - Host cp6002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms
[14:25:41] <icinga-wm>	 RECOVERY - Host cp6003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms
[14:25:41] <icinga-wm>	 RECOVERY - Host cp6005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms
[14:25:41] <icinga-wm>	 RECOVERY - Host cp6004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.03 ms
[14:25:43] <icinga-wm>	 RECOVERY - Host cp6009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms
[14:25:43] <icinga-wm>	 RECOVERY - Host cp6010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms
[14:25:43] <icinga-wm>	 RECOVERY - Host cp6012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms
[14:25:43] <icinga-wm>	 RECOVERY - Host cp6013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms
[14:25:43] <icinga-wm>	 RECOVERY - Host cp6015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms
[14:25:44] <icinga-wm>	 RECOVERY - Host cp6016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms
[14:25:45] <icinga-wm>	 PROBLEM - DPKG on aqs2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[14:25:45] <icinga-wm>	 PROBLEM - AQS root url on aqs2003 is CRITICAL: connect to address 10.192.0.211 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[14:25:46] <icinga-wm>	 PROBLEM - AQS root url on aqs2010 is CRITICAL: connect to address 10.192.48.187 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[14:27:35] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:27:39] <TheresNoTime>	 urbanecm: ref https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=1988703&diffmode=source, do you think I could get them on the UTC late deployment?
[14:28:05] <urbanecm>	 TheresNoTime: assuming you addressed my -1 from few hours ago, why not? :)
[14:28:19] <icinga-wm>	 PROBLEM - IPMI Sensor Status on lvs6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:28:19] <TheresNoTime>	 yup, run them through svgo ^^
[14:28:23] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.0.214:9042 on aqs2001 is CRITICAL: connect to address 10.192.0.214 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:28:23] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.0.220:9042 on aqs2004 is CRITICAL: connect to address 10.192.0.220 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:28:23] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.186:9042 on aqs2007 is CRITICAL: connect to address 10.192.16.186 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:28:31] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:29:27] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6014 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:30:09] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:30:49] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:31:03] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.183:9042 on aqs2006 is CRITICAL: connect to address 10.192.16.183 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:31:03] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.0.220:7001 on aqs2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:31:03] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.0.214:7001 on aqs2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:31:03] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.186:7001 on aqs2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:32:33] <elukey>	 btullis: o/ all downtimes expired?
[14:32:38] <urbanecm>	 TheresNoTime: great. see you in a few hours in that case :)
[14:32:48] <elukey>	 (for AQS I mean, the codfw nodes are not up yet right?)
[14:33:31] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:33:31] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:33:41] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.0.218:9042 on aqs2003 is CRITICAL: connect to address 10.192.0.218 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:33:41] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.174:9042 on aqs2005 is CRITICAL: connect to address 10.192.16.174 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:33:41] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.183:7001 on aqs2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:33:43] <icinga-wm>	 PROBLEM - cassandra-a service on aqs2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:34:09] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:34:22] <wikibugs>	 (03PS1) 10Klausman: Add inference-staging service IP (10.2.1.58) [puppet] - 10https://gerrit.wikimedia.org/r/805134
[14:34:32] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.dns.netbox
[14:34:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:51] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6013 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:35:59] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:36:13] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:36:19] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.0.215:9042 on aqs2001 is CRITICAL: connect to address 10.192.0.215 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:36:19] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.0.221:9042 on aqs2004 is CRITICAL: connect to address 10.192.0.221 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:36:19] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.16.187:9042 on aqs2007 is CRITICAL: connect to address 10.192.16.187 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:36:19] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.174:7001 on aqs2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:36:19] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.0.218:7001 on aqs2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:37:37] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:38:21] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:38:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:39] <icinga-wm>	 PROBLEM - IPMI Sensor Status on lvs6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:38:45] <wikibugs>	 (03PS2) 10Klausman: Add nference-staging service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/805134
[14:38:53] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.16.185:9042 on aqs2006 is CRITICAL: connect to address 10.192.16.185 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:38:53] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.48.194:9042 on aqs2010 is CRITICAL: connect to address 10.192.48.194 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:38:55] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.0.215:7001 on aqs2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:38:55] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.0.221:7001 on aqs2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:38:55] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.16.187:7001 on aqs2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:39:04] <wikibugs>	 (03PS5) 10Jbond: puppetmaster: update private repo pre-commit to error un-staged [puppet] - 10https://gerrit.wikimedia.org/r/803560
[14:39:17] <wikibugs>	 (03CR) 10Jbond: "fixed thanks" [puppet] - 10https://gerrit.wikimedia.org/r/803560 (owner: 10Jbond)
[14:39:24] <wikibugs>	 (03PS1) 10Klausman: Add inference-staging service IP (10.2.1.58) [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195)
[14:39:49] <wikibugs>	 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Arnoldokoth) @Dzahn Yeah, sure. Let me close this now. Thanks.
[14:40:02] <wikibugs>	 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Arnoldokoth) 05Open→03Resolved
[14:40:08] <wikibugs>	 (03PS3) 10Klausman: Add inference-staging service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/805134
[14:41:03] <wikibugs>	 (03PS4) 10Klausman: Add inference-staging service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/805134
[14:41:06] <wikibugs>	 (03PS2) 10Marostegui: wmnet: Failover m1 and m2 [dns] - 10https://gerrit.wikimedia.org/r/805114 (https://phabricator.wikimedia.org/T310484)
[14:41:33] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.0.219:9042 on aqs2003 is CRITICAL: connect to address 10.192.0.219 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:41:33] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.16.179:9042 on aqs2005 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:41:35] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.16.185:7001 on aqs2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:41:35] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.48.194:7001 on aqs2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:41:35] <icinga-wm>	 PROBLEM - cassandra-b service on aqs2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:42:03] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01119 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:42:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1 and m2 [dns] - 10https://gerrit.wikimedia.org/r/805114 (https://phabricator.wikimedia.org/T310484) (owner: 10Marostegui)
[14:42:35] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:42:53] <marostegui>	 !log Failover m1 and m2 to a different proxy T310484
[14:42:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:56] <stashbot>	 T310484: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484
[14:43:16] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) m1 and m2 failed over.
[14:43:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T310011)', diff saved to https://phabricator.wikimedia.org/P29668 and previous config saved to /var/cache/conftool/dbconfig/20220613-144337-marostegui.json
[14:43:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:41] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[14:43:48] <volans>	 Emperor: I think the puppet failures alert is related to the aqs cassandra issues above, puppet is failing on them
[14:43:57] <volans>	 Unable to locate package cassandra-tools-wmf
[14:44:00] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update s6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/805137 (https://phabricator.wikimedia.org/T300471)
[14:44:02] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/805121 (https://phabricator.wikimedia.org/T300471)
[14:44:09] <icinga-wm>	 PROBLEM - AQS root url on aqs2005 is CRITICAL: connect to address 10.192.16.42 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[14:44:09] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.0.219:7001 on aqs2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:44:09] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.16.179:7001 on aqs2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:44:14] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [dns] - 10https://gerrit.wikimedia.org/r/805137 (https://phabricator.wikimedia.org/T300471) (owner: 10Marostegui)
[14:44:41] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6016 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:44:41] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:44:59] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6015 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:45:27] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] wmnet: Update s6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/805137 (https://phabricator.wikimedia.org/T300471) (owner: 10Marostegui)
[14:45:31] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:46:43] <icinga-wm>	 PROBLEM - AQS root url on aqs2007 is CRITICAL: connect to address 10.192.16.169 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[14:46:43] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.195:9042 on aqs2010 is CRITICAL: connect to address 10.192.48.195 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:47:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:47:25] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[14:47:59] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti6004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:48:45] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:49:18] <wikibugs>	 (03CR) 10Elukey: "Change looks good, for consistency I recall in the past that people asked to allocated eqiad as well even if not used." [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[14:49:21] <icinga-wm>	 PROBLEM - AQS root url on aqs2001 is CRITICAL: connect to address 10.192.0.111 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[14:49:23] <icinga-wm>	 PROBLEM - AQS root url on aqs2004 is CRITICAL: connect to address 10.192.0.212 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[14:49:23] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.195:7001 on aqs2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:52:08] <icinga-wm>	 RECOVERY - Host ganeti6004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms
[14:52:12] <icinga-wm>	 RECOVERY - Host lvs6002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.09 ms
[14:52:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Looks good, the change should be a no-op but let's have also another pair of eyes to confirm!" [puppet] - 10https://gerrit.wikimedia.org/r/805134 (owner: 10Klausman)
[14:52:37] <wikibugs>	 (03PS2) 10Klausman: Add inference-staging service IP (10.2.1.58) [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195)
[14:53:22] <wikibugs>	 (03CR) 10Klausman: Add inference-staging service IP (10.2.1.58) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[14:53:44] <wikibugs>	 (03PS3) 10Klausman: Add inference-staging service IP (10.2.1.58) [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195)
[14:54:15] <icinga-wm>	 RECOVERY - Host cp6007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms
[14:54:16] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.dns.netbox
[14:54:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:32] <icinga-wm>	 RECOVERY - Host cp6008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms
[14:54:32] <icinga-wm>	 RECOVERY - Host cp6011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms
[14:55:26] <icinga-wm>	 PROBLEM - AQS root url on aqs2012 is CRITICAL: connect to address 10.192.48.189 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[14:56:32] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:58:00] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.48.192:9042 on aqs2009 is CRITICAL: connect to address 10.192.48.192 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:58:38] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P29669 and previous config saved to /var/cache/conftool/dbconfig/20220613-145842-marostegui.json
[14:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:08] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:00:37] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1047.eqiad.wmnet
[15:00:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:42] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.48.192:7001 on aqs2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:01:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:02:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] only page for NEL after 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/804640 (owner: 10CDanis)
[15:03:16] <icinga-wm>	 PROBLEM - AQS root url on aqs2011 is CRITICAL: connect to address 10.192.48.188 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:03:42] <icinga-wm>	 RECOVERY - Host asw1-b12-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.45 ms
[15:04:04] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:04:05] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti6003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:04:12] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] only page for NEL after 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/804640 (owner: 10CDanis)
[15:04:15] <icinga-wm>	 RECOVERY - Router interfaces on mr1-drmrs is OK: OK: host 185.15.58.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:04:30] <icinga-wm>	 RECOVERY - Host ps1-b13-drmrs is UP: PING OK - Packet loss = 0%, RTA = 88.13 ms
[15:04:42] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:04:42] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1047.eqiad.wmnet
[15:04:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:50] <icinga-wm>	 RECOVERY - Host asw1-b13-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.41 ms
[15:04:55] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[15:05:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 60, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:05:28] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:05:44] <icinga-wm>	 RECOVERY - Host cr1-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.42 ms
[15:05:44] <icinga-wm>	 RECOVERY - Host cr2-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.47 ms
[15:06:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:06:28] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6013 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:06:34] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6011 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:06:50] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:06:54] <wikibugs>	 (03PS1) 10Muehlenhoff: acme_chief: Remove old buster IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/805140 (https://phabricator.wikimedia.org/T308214)
[15:07:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10MoritzMuehlenhoff)
[15:08:15] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6012 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:08:18] <icinga-wm>	 RECOVERY - Host scs-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.60 ms
[15:08:20] <icinga-wm>	 RECOVERY - Host dns6002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 99.18 ms
[15:08:20] <icinga-wm>	 RECOVERY - Host ps1-b12-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.81 ms
[15:08:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cassandra in analytics@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro)
[15:09:24] <icinga-wm>	 RECOVERY - IPMI Sensor Status on lvs6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:10:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you! LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[15:10:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[15:11:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis)
[15:12:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host theemin.codfw.wmnet
[15:12:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:35] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[15:13:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P29670 and previous config saved to /var/cache/conftool/dbconfig/20220613-151347-marostegui.json
[15:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:57] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Eevans) We may also need to install Buster on these, instead of Bullseye (see: https://phabricator.wikimedia.org/T307801#7999033).
[15:14:11] <wikibugs>	 (03PS1) 10David Caro: openstack,nova-api-metadata: add harakiri timeout [puppet] - 10https://gerrit.wikimedia.org/r/805166 (https://phabricator.wikimedia.org/T309930)
[15:14:33] <wikibugs>	 (03CR) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro)
[15:15:42] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:15:42] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6016 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:15:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove secteam-users group [puppet] - 10https://gerrit.wikimedia.org/r/805167
[15:16:02] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6015 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:16:32] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:17:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host theemin.codfw.wmnet
[15:17:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove LDAP access for johnben [puppet] - 10https://gerrit.wikimedia.org/r/805168
[15:19:02] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti6004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:19:50] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:20:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for johnben [puppet] - 10https://gerrit.wikimedia.org/r/805168 (owner: 10Muehlenhoff)
[15:21:28] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:21:32] <icinga-wm>	 PROBLEM - Host thumbor2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:22:41] <icinga-wm>	 RECOVERY - IPMI Sensor Status on lvs6003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:23:03] <icinga-wm>	 RECOVERY - IPMI Sensor Status on dns6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:23:43] <icinga-wm>	 RECOVERY - IPMI Sensor Status on dns6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:24:35] <icinga-wm>	 PROBLEM - DNS on cp6016.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.136.128.30 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:25:05] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:25:19] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Eevans)
[15:25:28] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] multiversion: Simplify code and improve documentation (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308 (owner: 10Krinkle)
[15:26:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Eevans) @Cmjohnson It doesn't look like any of the OS installations succeeded (yet), is it too late to ask for Buster instead?
[15:26:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:26:37] <icinga-wm>	 RECOVERY - Host thumbor2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.52 ms
[15:27:59] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:28:11] <icinga-wm>	 RECOVERY - Host thumbor2004 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms
[15:28:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] noc: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803943 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:28:45] <icinga-wm>	 RECOVERY - IPMI Sensor Status on lvs6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:28:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T310011)', diff saved to https://phabricator.wikimedia.org/P29671 and previous config saved to /var/cache/conftool/dbconfig/20220613-152852-marostegui.json
[15:28:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[15:28:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[15:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:58] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[15:28:59] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:29:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T310011)', diff saved to https://phabricator.wikimedia.org/P29672 and previous config saved to /var/cache/conftool/dbconfig/20220613-152900-marostegui.json
[15:29:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:03] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:29:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:55] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6014 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:29:58] <wikibugs>	 (03PS2) 10David Caro: openstack,nova-api-metadata: add harakiri timeout [puppet] - 10https://gerrit.wikimedia.org/r/805166 (https://phabricator.wikimedia.org/T309930)
[15:30:04] <jouncebot>	 jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220613T1530).
[15:30:37] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:31:07] <wikibugs>	 (03CR) 10David Caro: "Tested on codfw for validity, not 100% sure it will fix the issues. I might create another check that does an actual request every 5 min t" [puppet] - 10https://gerrit.wikimedia.org/r/805166 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro)
[15:31:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:31:20] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2001.codfw.wmnet with OS buster
[15:31:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:23] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp6005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:34:48] <wikibugs>	 (03CR) 10SBassett: [C: 03+1] "Yeah, as long as nothing else uses this, it should be removed." [puppet] - 10https://gerrit.wikimedia.org/r/805167 (owner: 10Muehlenhoff)
[15:35:40] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805173 (https://phabricator.wikimedia.org/T128546)
[15:36:21] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805173 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:37:52] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805173 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:39:29] <icinga-wm>	 PROBLEM - Host thumbor2004 is DOWN: PING CRITICAL - Packet loss = 100%
[15:39:58] <wikibugs>	 (03PS3) 10David Caro: openstack,nova-api-metadata: add harakiri timeout [puppet] - 10https://gerrit.wikimedia.org/r/805166 (https://phabricator.wikimedia.org/T309930)
[15:40:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:40] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[15:44:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:44:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:44:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:31] <icinga-wm>	 RECOVERY - Host thumbor2004 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms
[15:47:32] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2001.codfw.wmnet with reason: host reimage
[15:47:34] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:805173| Bumping portals to master (T128546)]] (duration: 03m 35s)
[15:47:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:39] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[15:48:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:48:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove tendril leftover [puppet] - 10https://gerrit.wikimedia.org/r/805176
[15:50:14] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2001.codfw.wmnet with reason: host reimage
[15:50:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:02] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:805173| Bumping portals to master (T128546)]] (duration: 03m 27s)
[15:51:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:53:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:53:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:56:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:56:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:58:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:58:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:58:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:08] <wikibugs>	 (03PS1) 10Clare Ming: Turn off TOC A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805179 (https://phabricator.wikimedia.org/T309683)
[16:00:30] <wikibugs>	 (03PS7) 10BCornwall: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723)
[16:00:55] <wikibugs>	 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn) unfortunately this is purchase date 2016-12-12  .. so ...probably can't get it fixed
[16:02:09] <wikibugs>	 10ops-drmrs: drmrs 1/2 power feed down due to maintenance - https://phabricator.wikimedia.org/T310470 (10RobH) all drmrs hosts have gone green in icinga on ipmi checks and mgmt dns (both went red from power removal)
[16:02:24] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[16:02:33] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:03:02] <wikibugs>	 (03PS1) 10RobH: Revert "smokeping: Temp mute smokeping for host lvs6001" [puppet] - 10https://gerrit.wikimedia.org/r/805156
[16:03:13] <wikibugs>	 (03PS2) 10RobH: Revert "smokeping: Temp mute smokeping for host lvs6001" [puppet] - 10https://gerrit.wikimedia.org/r/805156
[16:03:34] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[16:03:57] <wikibugs>	 (03CR) 10RobH: [C: 03+2] Revert "smokeping: Temp mute smokeping for host lvs6001" [puppet] - 10https://gerrit.wikimedia.org/r/805156 (owner: 10RobH)
[16:04:09] <wikibugs>	 (03PS1) 10Zabe: coal: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805180 (https://phabricator.wikimedia.org/T308013)
[16:04:12] <wikibugs>	 (03PS1) 10Zabe: cmd_checklist_runner: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805181 (https://phabricator.wikimedia.org/T308013)
[16:04:15] <icinga-wm>	 RECOVERY - DNS on cp6016.mgmt is OK: DNS OK: 0.017 seconds response time. cp6016.mgmt.drmrs.wmnet returns 10.136.128.30 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:04:18] <wikibugs>	 (03PS1) 10Zabe: cloudnfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805182 (https://phabricator.wikimedia.org/T308013)
[16:04:20] <wikibugs>	 (03PS1) 10Zabe: cloudlib: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805183 (https://phabricator.wikimedia.org/T308013)
[16:04:22] <wikibugs>	 (03PS2) 10David Caro: nova: add user to libvirt-qemu [puppet] - 10https://gerrit.wikimedia.org/r/801336 (https://phabricator.wikimedia.org/T309342)
[16:04:24] <wikibugs>	 (03PS1) 10Zabe: cinderutils: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805184 (https://phabricator.wikimedia.org/T308013)
[16:04:26] <wikibugs>	 (03PS1) 10Zabe: cfssl: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805185 (https://phabricator.wikimedia.org/T308013)
[16:04:28] <wikibugs>	 (03PS1) 10Zabe: cergen: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805186 (https://phabricator.wikimedia.org/T308013)
[16:04:30] <wikibugs>	 (03PS1) 10Zabe: celery: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805187 (https://phabricator.wikimedia.org/T308013)
[16:04:32] <wikibugs>	 (03PS1) 10Zabe: cacheproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805188 (https://phabricator.wikimedia.org/T308013)
[16:04:34] <wikibugs>	 (03PS1) 10Zabe: burrow: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805189 (https://phabricator.wikimedia.org/T308013)
[16:04:36] <wikibugs>	 (03PS1) 10Zabe: bsection: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805190 (https://phabricator.wikimedia.org/T308013)
[16:04:38] <wikibugs>	 (03PS1) 10Zabe: bigtop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805191 (https://phabricator.wikimedia.org/T308013)
[16:04:40] <wikibugs>	 (03PS1) 10Zabe: backy2: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805192 (https://phabricator.wikimedia.org/T308013)
[16:04:42] <wikibugs>	 (03PS1) 10Zabe: atskafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805193 (https://phabricator.wikimedia.org/T308013)
[16:04:44] <wikibugs>	 (03PS1) 10Zabe: aqs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805194 (https://phabricator.wikimedia.org/T308013)
[16:04:46] <wikibugs>	 (03PS1) 10Zabe: apereo_cas: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805195 (https://phabricator.wikimedia.org/T308013)
[16:04:48] <wikibugs>	 (03PS1) 10Zabe: alternatives: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805196 (https://phabricator.wikimedia.org/T308013)
[16:04:50] <wikibugs>	 (03PS1) 10Zabe: airflow: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805197 (https://phabricator.wikimedia.org/T308013)
[16:05:05] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack,nova-api-metadata: add harakiri timeout [puppet] - 10https://gerrit.wikimedia.org/r/805166 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro)
[16:05:18] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack,nova-api-metadata: add harakiri timeout [puppet] - 10https://gerrit.wikimedia.org/r/805166 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro)
[16:06:17] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:06:19] <wikibugs>	 (03Merged) 10jenkins-bot: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[16:06:43] <wikibugs>	 (03PS5) 10BCornwall: Traffic: Add alert for Varnish child restart [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723)
[16:07:33] <wikibugs>	 (03PS2) 10Zabe: cmd_checklist_runner: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805181 (https://phabricator.wikimedia.org/T308013)
[16:10:16] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul)
[16:10:34] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[16:11:04] <wikibugs>	 (03PS2) 10Zabe: burrow: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805189 (https://phabricator.wikimedia.org/T308013)
[16:11:43] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] nova: add user to libvirt-qemu [puppet] - 10https://gerrit.wikimedia.org/r/801336 (https://phabricator.wikimedia.org/T309342) (owner: 10David Caro)
[16:12:18] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[16:12:25] <icinga-wm>	 PROBLEM - IPMI Sensor Status on thumbor2004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:14:37] <icinga-wm>	 RECOVERY - DPKG on aqs2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[16:15:51] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:19:31] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2001.codfw.wmnet with OS buster
[16:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bsection: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805190 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[16:20:41] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:20:45] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.0.214:7001 on aqs2001 is OK: SSL OK - Certificate aqs2001-a valid until 2024-06-07 14:43:29 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[16:20:45] <icinga-wm>	 RECOVERY - cassandra-a service on aqs2001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:20:57] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[16:21:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:24:02] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Dzahn) A1: serviceops: gitlab2002 is still in state "in setup". While we were going to change that we will hold back until this is done.
[16:25:05] <wikibugs>	 (03PS2) 10Clare Ming: Turn off TOC A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805179 (https://phabricator.wikimedia.org/T309683)
[16:26:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Remove tendril leftover [puppet] - 10https://gerrit.wikimedia.org/r/805176 (owner: 10Muehlenhoff)
[16:28:26] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2002.codfw.wmnet with OS buster
[16:28:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/805180 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[16:28:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:57] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.0.214:9042 on aqs2001 is OK: TCP OK - 0.032 second response time on 10.192.0.214 port 9042 https://phabricator.wikimedia.org/T93886
[16:29:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T310011)', diff saved to https://phabricator.wikimedia.org/P29673 and previous config saved to /var/cache/conftool/dbconfig/20220613-162914-marostegui.json
[16:29:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:20] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[16:31:59] <marostegui>	 !log Reboot all codfw parsercache hosts T310485
[16:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:09] <logmsgbot>	 !log dancy@deploy1002 prep aborted:  (duration: 00m 26s)
[16:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:15] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti3001.esams.wmnet with OS bullseye
[16:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti3001.esams.wmnet with OS bullseye
[16:32:32] <marostegui>	 !log dbmaint x2@eqiad upgrade and reboot all x2 db hosts T310485
[16:32:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:41] <wikibugs>	 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10hnowlan) Thanks for the heads-up @Dzahn. We don't have replacement hardware budgeted because of the planned move to k8s, but we're looking into stopgap options
[16:34:41] <wikibugs>	 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10WDoranWMF) Thanks for this @Dzahn. #platform_engineering are reaching out to #dc-ops to see what our options are.
[16:35:00] <wikibugs>	 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10WDoranWMF) p:05Medium→03High
[16:35:04] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) @Dzahn we are not doing rack A1 until maybe the end os the year because we don't have the PDU's yet for that rack same for A8
[16:36:17] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.0.215:9042 on aqs2001 is OK: TCP OK - 0.032 second response time on 10.192.0.215 port 9042 https://phabricator.wikimedia.org/T93886
[16:37:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "I don't agree on this being a no-op as "inference-staging: [codfw]" is new. But as that's what the commit message said., +1 from me :)" [puppet] - 10https://gerrit.wikimedia.org/r/805134 (owner: 10Klausman)
[16:37:35] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1144.eqiad.wmnet with OS buster
[16:37:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster
[16:38:10] <wikibugs>	 (03CR) 10Ssingh: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[16:38:26] <logmsgbot>	 !log dancy@deploy1002 prep aborted:  (duration: 06m 12s)
[16:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:41] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.0.215:7001 on aqs2001 is OK: SSL OK - Certificate aqs2001-b valid until 2024-06-07 14:43:32 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[16:39:47] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset and Tunilo for Caroline Myrick - https://phabricator.wikimedia.org/T310524 (10CMyrick-WMF)
[16:40:25] <logmsgbot>	 !log dancy@deploy1002 prep aborted:  (duration: 01m 40s)
[16:40:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:40] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset and Tunilo for Caroline Myrick - https://phabricator.wikimedia.org/T310524 (10CMyrick-WMF)
[16:41:05] <icinga-wm>	 RECOVERY - cassandra-b service on aqs2001 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:42:15] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0102 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[16:43:11] <icinga-wm>	 RECOVERY - IPMI Sensor Status on thumbor2004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:43:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:44:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P29674 and previous config saved to /var/cache/conftool/dbconfig/20220613-164419-marostegui.json
[16:44:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:49] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2002.codfw.wmnet with reason: host reimage
[16:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:11] <wikibugs>	 (03PS1) 10David Caro: openstack: set the nova user groups on virts only [puppet] - 10https://gerrit.wikimedia.org/r/805200 (https://phabricator.wikimedia.org/T309342)
[16:47:41] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2002.codfw.wmnet with reason: host reimage
[16:47:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:49:02] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1144.eqiad.wmnet with reason: host reimage
[16:49:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:05] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1145.eqiad.wmnet with OS buster
[16:49:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster
[16:49:53] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:50:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:50:26] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35836/console" [puppet] - 10https://gerrit.wikimedia.org/r/805200 (https://phabricator.wikimedia.org/T309342) (owner: 10David Caro)
[16:50:50] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3001.esams.wmnet with reason: host reimage
[16:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10brion)
[16:53:09] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1145.eqiad.wmnet with OS buster
[16:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:42] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS buster
[16:53:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster exec...
[16:54:01] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1144.eqiad.wmnet with reason: host reimage
[16:54:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster
[16:54:28] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack: set the nova user groups on virts only [puppet] - 10https://gerrit.wikimedia.org/r/805200 (https://phabricator.wikimedia.org/T309342) (owner: 10David Caro)
[16:54:31] <wikibugs>	 (03CR) 10David Caro: [V: 03+2 C: 03+2] openstack: set the nova user groups on virts only [puppet] - 10https://gerrit.wikimedia.org/r/805200 (https://phabricator.wikimedia.org/T309342) (owner: 10David Caro)
[16:55:04] <wikibugs>	 10ops-drmrs: drmrs 1/2 power feed down due to maintenance - https://phabricator.wikimedia.org/T310470 (10RobH) all green and maint window end announce sent by drmrs
[16:55:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:55:41] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[16:55:42] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3001.esams.wmnet with reason: host reimage
[16:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:13] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1146.eqiad.wmnet with OS buster
[16:58:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:26] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1147.eqiad.wmnet with OS buster
[16:58:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:49] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1048.eqiad.wmnet
[16:58:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:55] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-netbox.state on dns3002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[16:59:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P29675 and previous config saved to /var/cache/conftool/dbconfig/20220613-165925-marostegui.json
[16:59:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster exec...
[16:59:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster
[17:00:05] <jouncebot>	 ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220613T1700).
[17:02:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] wmcs: relabel alerts from wmcs cluster with wmcs team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro)
[17:03:05] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1048.eqiad.wmnet
[17:03:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:37] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1148.eqiad.wmnet with OS buster
[17:04:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster
[17:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:05:43] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1144.eqiad.wmnet with OS buster
[17:05:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster comp...
[17:07:19] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:09:57] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1147.eqiad.wmnet with reason: host reimage
[17:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:49] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3001.esams.wmnet with OS bullseye
[17:11:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti3001.esams.wmnet with OS bullseye completed: - ganeti3001 (**PASS**)   - Downtimed on Icinga/Ale...
[17:12:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:13:02] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1147.eqiad.wmnet with reason: host reimage
[17:13:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10RobH) a:05RobH→03MoritzMuehlenhoff ganeti3001 firmware updates bios  2.2.11 to 2.14.2 nic  21.40.22.20 to 21.85.21.92 idrac  3.34.34.34 to 5.10.10.00  Moritz,  ganeti3001 firmware updated an...
[17:14:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T310011)', diff saved to https://phabricator.wikimedia.org/P29676 and previous config saved to /var/cache/conftool/dbconfig/20220613-171430-marostegui.json
[17:14:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[17:14:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[17:14:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:35] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[17:14:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T310011)', diff saved to https://phabricator.wikimedia.org/P29677 and previous config saved to /var/cache/conftool/dbconfig/20220613-171438-marostegui.json
[17:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:10] <wikibugs>	 (03PS1) 10Clare Ming: Disable TOC A/B test for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805206 (https://phabricator.wikimedia.org/T309683)
[17:16:10] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1148.eqiad.wmnet with reason: host reimage
[17:16:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:37] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 6 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35837/console" [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro)
[17:18:49] <wikibugs>	 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Papaul) 05Open→03Resolved Nothing in the IDRAC log showing any HW issues. I did some firmware upgrade  Bios from  2.3.4 to 2.13 IDRAC from  2.63.60.61 to 2.83.83  maybe with the new firmware we can see somet...
[17:18:51] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster
[17:18:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster
[17:19:14] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1148.eqiad.wmnet with reason: host reimage
[17:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:17] <icinga-wm>	 RECOVERY - puppet last run on thumbor2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:21:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:22:51] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1145.eqiad.wmnet with OS buster
[17:22:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster
[17:24:00] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1147.eqiad.wmnet with OS buster
[17:24:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster comp...
[17:26:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:26:55] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thumbor2004.codfw.wmnet
[17:26:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:01] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2002.codfw.wmnet with OS buster
[17:29:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:56] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS buster
[17:29:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster
[17:30:18] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1143.eqiad.wmnet with reason: host reimage
[17:30:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:16] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1148.eqiad.wmnet with OS buster
[17:31:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster comp...
[17:33:26] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1143.eqiad.wmnet with reason: host reimage
[17:33:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:24] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1145.eqiad.wmnet with reason: host reimage
[17:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:03] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Disable TOC A/B test for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805206 (https://phabricator.wikimedia.org/T309683) (owner: 10Clare Ming)
[17:37:33] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1145.eqiad.wmnet with reason: host reimage
[17:37:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:25] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1146.eqiad.wmnet with reason: host reimage
[17:41:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:32] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1146.eqiad.wmnet with reason: host reimage
[17:44:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:07] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1143.eqiad.wmnet with OS buster
[17:47:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster comp...
[17:49:50] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1145.eqiad.wmnet with OS buster
[17:49:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster comp...
[17:55:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T310011)', diff saved to https://phabricator.wikimedia.org/P29678 and previous config saved to /var/cache/conftool/dbconfig/20220613-175500-marostegui.json
[17:55:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:09] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[17:55:29] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1146.eqiad.wmnet with OS buster
[17:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster comp...
[18:02:36] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Traffic: Add alert for Varnish child restart [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[18:05:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Brion to contributors [puppet] - 10https://gerrit.wikimedia.org/r/805214 (https://phabricator.wikimedia.org/T308013)
[18:07:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Brion to contributors [puppet] - 10https://gerrit.wikimedia.org/r/805214 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[18:10:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P29679 and previous config saved to /var/cache/conftool/dbconfig/20220613-181005-marostegui.json
[18:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:42] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1049.eqiad.wmnet
[18:23:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P29680 and previous config saved to /var/cache/conftool/dbconfig/20220613-182510-marostegui.json
[18:25:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:03] <wikibugs>	 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10WDoranWMF) thank you @Papaul.  @hnowlan would should review the other machine's state.
[18:27:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) 05Open→03Resolved Finally resolved this, had some issues with network ports not being correct
[18:40:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T310011)', diff saved to https://phabricator.wikimedia.org/P29681 and previous config saved to /var/cache/conftool/dbconfig/20220613-184015-marostegui.json
[18:40:18] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[18:40:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[18:40:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:21] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[18:40:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:51] <icinga-wm>	 PROBLEM - Host mc1049 is DOWN: PING CRITICAL - Packet loss = 100%
[18:46:35] <rzl>	 ^ reboot in progress, weird that it didn't get downtimed? cc arnoldokoth 
[18:48:05] <arnoldokoth>	 Yeah, looks like it hasn't come back yet.
[18:48:43] <rzl>	 oh, so the downtime just expired before it came back up
[18:48:56] <rzl>	 that's a little weird in itself, it shouldn't take that long
[18:49:26] <mutante>	 maybe it's running an fsck?
[18:49:48] <mutante>	 checked console yet, arnoldokoth ?
[18:50:02] <arnoldokoth>	 Checking.
[18:51:11] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:52:30] <wikibugs>	 10SRE, 10Thumbor, 10Traffic: Thumbor URLs are too permissive - https://phabricator.wikimedia.org/T310528 (10TheDJ) This shouldn't be a problem as long as MediaWiki only generates url fragments that are lowercase (which is what it should be doing). In general, thumbor is a tad more permissive than MediaWiki (...
[18:55:20] <mutante>	 !log gitlab2002 - rebooting 
[18:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:01] <icinga-wm>	 RECOVERY - Host mc1049 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[19:00:55] <rzl>	 \o/
[19:01:10] <rzl>	 arnoldokoth: was that you, or did it just come back on its own?
[19:01:59] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1049.eqiad.wmnet
[19:02:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:08] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[19:03:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[19:03:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T310011)', diff saved to https://phabricator.wikimedia.org/P29682 and previous config saved to /var/cache/conftool/dbconfig/20220613-190314-marostegui.json
[19:03:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:19] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[19:04:09] <mutante>	 !log gitlab2003 - rebooting
[19:04:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:48] <arnoldokoth>	 rzl: i power cycled it from the mgmt console. no idea if that's what fixed it.
[19:05:00] <rzl>	 cool, sounds like it to me
[19:05:22] <mutante>	 fwiw, this has happened to me in the past.. every once in a while
[19:05:59] <mutante>	 like "cookbook asks for reboot but it does not come back and then it seems alright if you powercycle"
[19:06:11] <mutante>	 afair it was happening with 1 out of 20 mw appservers
[19:07:07] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:07:58] <mutante>	 !log gerrit2002 - rebooting
[19:07:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:01] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job cassandra in analytics@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:09:19] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10KFrancis) @CDanis There is not an NDA on file.  Please provide me with Ricardo Baeza-Yates postal address and I will put the agreement together.  Please sen...
[19:10:35] <wikibugs>	 (03PS1) 10Marcelo1251: Point Wikimedia Enterprise HTML Dumps to trial API features [puppet] - 10https://gerrit.wikimedia.org/r/805223 (https://phabricator.wikimedia.org/T310075)
[19:11:32] <mutante>	 !log etherpad - minimal downtime - rebooting etherpad1003
[19:11:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:06] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on etherpad1003.eqiad.wmnet with reason: kernel upgrade
[19:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:09] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on etherpad1003.eqiad.wmnet with reason: kernel upgrade
[19:12:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:27] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:16:27] <urandom>	 if there is anyone around able to reimage hosts, who is also bored and looking for something to do, I can help :)
[19:16:44] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10leila) @KFrancis I'm not sure if this will make a difference in your recommendation, however, please be aware that Ricardo has signed a contract with WMF an...
[19:28:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T310011)', diff saved to https://phabricator.wikimedia.org/P29683 and previous config saved to /var/cache/conftool/dbconfig/20220613-192851-marostegui.json
[19:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:57] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[19:29:17] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10KFrancis) Hi all, the agreement is out for signatures.  Thanks!
[19:38:16] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10KFrancis) @leila Thank you!  I didn't see Ricardo's name on the contractor list at first, but I checked again and it's there.  Thank you for bringing this t...
[19:43:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P29684 and previous config saved to /var/cache/conftool/dbconfig/20220613-194356-marostegui.json
[19:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P29685 and previous config saved to /var/cache/conftool/dbconfig/20220613-195902-marostegui.json
[19:59:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220613T2000).
[20:00:04] <jouncebot>	 TheresNoTime and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:10] * TheresNoTime is here
[20:00:16] * urbanecm waves
[20:00:19] <cjming>	 o/ 
[20:00:23] <cjming>	 i can deploy
[20:00:27] <urbanecm>	 go ahead :)
[20:00:51] <cjming>	 urbanecm: do those logo patches lgtu?
[20:01:11] <urbanecm>	 looks i forgot to +1
[20:01:12] <urbanecm>	 yes :)
[20:01:22] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] crhwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar)
[20:01:26] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar)
[20:01:45] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] crhwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar)
[20:02:41] <wikibugs>	 (03Merged) 10jenkins-bot: crhwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar)
[20:03:56] <cjming>	 TheresNoTime: can you check mwdebug1002 for your 1st patch?
[20:04:01] <TheresNoTime>	 looking
[20:04:41] <TheresNoTime>	 cjming: lgtm :)
[20:04:49] <cjming>	 great - syncing
[20:06:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:06:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:15] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:08:21] <logmsgbot>	 !log cjming@deploy1002 Synchronized static/images/mobile/copyright/wikipedia-wordmark-crh.svg: Config: [[gerrit:800856|crhwiki: Add localized mobile wordmark (T309431)]] (duration: 03m 16s)
[20:08:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:24] <stashbot>	 T309431: Change Wikimedia Wordmark for crhwiki and ugwiki - https://phabricator.wikimedia.org/T309431
[20:08:47] <wikibugs>	 (03PS5) 10Clare Ming: ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar)
[20:09:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:09:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:09:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:29] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:11:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:11:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:57] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:800856|crhwiki: Add localized mobile wordmark (T309431)]] (duration: 03m 27s)
[20:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:09] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar)
[20:12:23] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1050.eqiad.wmnet
[20:12:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:49] <wikibugs>	 (03Merged) 10jenkins-bot: ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar)
[20:14:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T310011)', diff saved to https://phabricator.wikimedia.org/P29686 and previous config saved to /var/cache/conftool/dbconfig/20220613-201407-marostegui.json
[20:14:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[20:14:11] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[20:14:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:12] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[20:14:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[20:14:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[20:14:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:20] <cjming>	 TheresNoTime: your 1st patch should be live -- 2nd patch is up on mwdebug1002 - can you test?
[20:14:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T310011)', diff saved to https://phabricator.wikimedia.org/P29687 and previous config saved to /var/cache/conftool/dbconfig/20220613-201420-marostegui.json
[20:14:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:27] <TheresNoTime>	 looking
[20:15:00] <TheresNoTime>	 cjming: yup, lgtm as well :)
[20:15:11] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:15:12] <cjming>	 cool - syncing
[20:15:19] <wikibugs>	 10ops-eqiad, 10DC-Ops: Move network on cloudcephosd1021 from cloudsw1-c8-eqiad to cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T310546 (10nskaggs) p:05Triage→03Low
[20:15:21] <wikibugs>	 10ops-eqiad, 10DC-Ops: Move network connections on cloudcephosd1015 from cloudsw1-d5-eqiad to cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T310547 (10nskaggs) p:05Triage→03Low
[20:15:57] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Move network connections on cloudcephosd1015 from cloudsw1-d5-eqiad to cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T310547 (10nskaggs)
[20:16:30] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Move network connections on cloudcephosd1015 from cloudsw1-d5-eqiad to cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T310547 (10nskaggs)
[20:16:34] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Move network on cloudcephosd1021 from cloudsw1-c8-eqiad to cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T310546 (10nskaggs)
[20:16:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:16:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:28] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Move network on cloudcephosd1021 from cloudsw1-c8-eqiad to cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T310546 (10nskaggs)
[20:17:34] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Move network connections on cloudcephosd1015 from cloudsw1-d5-eqiad to cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T310547 (10nskaggs)
[20:17:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:17:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:17:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:25] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Recable cloudcephosd1015 from cloudsw1-d5-eqiad to cloudsw2-d5-eqiad  - https://phabricator.wikimedia.org/T310547 (10nskaggs)
[20:18:32] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Recable cloudcephosd1021 from cloudsw1-c8-eqiad to cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T310546 (10nskaggs)
[20:18:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:18:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:57] <logmsgbot>	 !log cjming@deploy1002 Synchronized static/images/mobile/copyright/wikipedia-wordmark-ug.svg: Config: [[gerrit:800857|ugwiki: Add localized mobile wordmark (T309431)]] (duration: 03m 36s)
[20:19:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:02] <stashbot>	 T309431: Change Wikimedia Wordmark for crhwiki and ugwiki - https://phabricator.wikimedia.org/T309431
[20:19:05] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1050.eqiad.wmnet
[20:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:29] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-Services, 10DC-Ops, and 2 others: move cloudcephmon1002.eqiad.wmnet from rack B4 to rack D5 - https://phabricator.wikimedia.org/T304096 (10nskaggs)
[20:19:46] <TheresNoTime>	 thank you for the deploy cjming :-)
[20:19:57] <cjming>	 you're welcome!
[20:20:08] <cjming>	 2nd patch should be live here shortly
[20:20:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10nskaggs) Filed T310546 and T310547 to free ports and allow cloudnet1005 and cloudnet1006 connections to cloudsw1*.
[20:20:41] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:22:34] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:800857|ugwiki: Add localized mobile wordmark (T309431)]] (duration: 03m 30s)
[20:22:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:42] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Disable TOC A/B test for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805206 (https://phabricator.wikimedia.org/T309683) (owner: 10Clare Ming)
[20:23:33] <wikibugs>	 (03Merged) 10jenkins-bot: Disable TOC A/B test for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805206 (https://phabricator.wikimedia.org/T309683) (owner: 10Clare Ming)
[20:27:35] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:805206|Disable TOC A/B test for beta cluster (T309683)]] (duration: 03m 29s)
[20:27:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:40] <stashbot>	 T309683: Turn off table of contents A/B test - https://phabricator.wikimedia.org/T309683
[20:28:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:28:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:03] <cjming>	 !log end of UTC late backport window
[20:29:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:29:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:29:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:32:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:37] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[20:55:41] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: That opportune time is upon us again. Time for a Weekly Security deployment window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220613T2100).
[21:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:06:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T310011)', diff saved to https://phabricator.wikimedia.org/P29688 and previous config saved to /var/cache/conftool/dbconfig/20220613-210603-marostegui.json
[21:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:10] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[21:21:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P29689 and previous config saved to /var/cache/conftool/dbconfig/20220613-212108-marostegui.json
[21:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:39] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Dzahn) ah ACK, ok, in that case we will just move forward as planned. Thanks Papaul
[21:24:25] <wikibugs>	 (03PS3) 10Ryan Kemper: Revert "elastic: increase recovery time" [cookbooks] - 10https://gerrit.wikimedia.org/r/784724 (https://phabricator.wikimedia.org/T305994) (owner: 10Bking)
[21:25:05] <wikibugs>	 (03PS3) 10Ryan Kemper: elastic: remove decommissioned hosts in beta [puppet] - 10https://gerrit.wikimedia.org/r/791666 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking)
[21:25:11] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/791666 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking)
[21:35:09] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS13030/IPv6: Active - Init7, AS13030/IPv4: Active - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:36:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P29690 and previous config saved to /var/cache/conftool/dbconfig/20220613-213613-marostegui.json
[21:36:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:15] <icinga-wm>	 PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS13030/IPv4: Connect - Init7, AS13030/IPv6: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:40:37] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS13030/IPv4: Connect - Init7, AS13030/IPv6: Active - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:44:37] <mutante>	 !log gitlab-runner1001 - pause from accepting jobs - rebooting
[21:44:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:15] <icinga-wm>	 RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 159, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:48:40] <mutante>	 !log gitlab-runner* - sequentially pausing, rebooting, resuming one by one
[21:48:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10XCollazo-WMF)
[21:49:53] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:51:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T310011)', diff saved to https://phabricator.wikimedia.org/P29691 and previous config saved to /var/cache/conftool/dbconfig/20220613-215118-marostegui.json
[21:51:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:23] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[21:51:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance
[21:51:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance
[21:51:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 12 hosts with reason: Maintenance
[21:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 12 hosts with reason: Maintenance
[21:51:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:41] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv4: Connect - Init7, AS13030/IPv6: Active - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:53:09] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:53:49] <icinga-wm>	 PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Active - Init7, AS13030/IPv4: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:54:31] <icinga-wm>	 PROBLEM - Host gitlab-runner1003 is DOWN: PING CRITICAL - Packet loss = 100%
[21:54:34] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Xcollazo - https://phabricator.wikimedia.org/T310385 (10Aklapper) 05Resolved→03Open Not yet fully done per https://wikitech.wikimedia.org/w/index.php?title=SRE%2FLDAP&type=revision&diff=1929377&oldid=1924287
[21:54:47] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:55:32] <icinga-wm>	 ACKNOWLEDGEMENT - Host gitlab-runner1003 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn maintenance
[21:56:06] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab-runner[1001-1004].eqiad.wmnet with reason: maintenance reboot
[21:56:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:12] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab-runner[1001-1004].eqiad.wmnet with reason: maintenance reboot
[21:56:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:37] <icinga-wm>	 RECOVERY - Host gitlab-runner1003 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms
[22:00:43] <icinga-wm>	 RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 23, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:01:51] <icinga-wm>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 25, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:10:28] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab-runner[2001-2004].codfw.wmnet with reason: maintenance reboot
[22:10:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:35] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab-runner[2001-2004].codfw.wmnet with reason: maintenance reboot
[22:10:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10WDoranWMF) As Xabriel's manager, I approve.
[22:15:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[22:15:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[22:15:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T310011)', diff saved to https://phabricator.wikimedia.org/P29692 and previous config saved to /var/cache/conftool/dbconfig/20220613-221522-marostegui.json
[22:15:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:27] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[22:17:38] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:18:17] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:22:38] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Request to create new mailing lists for Chinese Wikipedia Administrators - https://phabricator.wikimedia.org/T310465 (10KirkLU) @SLyngshede-WMF Thank you for doing all these for us.
[22:25:58] <wikibugs>	 (03PS1) 10BCornwall: Traffic: add varnishkafka delivery error alarms [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723)
[22:30:15] <wikibugs>	 10SRE, 10Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (10BCornwall) @ssingh @KOfori  Is there a need/desire to have these three instances around? If so, is there any objection to following the above and termin...
[22:30:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10Dzahn)
[22:31:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10Dzahn)
[22:32:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10Dzahn) Confirming @XCollazo-WMF exists and was introduced in SRE meeting today :) welcome to WMF. Confirmed signature and checked all other boxes. Just one is open for clinic duty.
[22:33:57] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Xcollazo - https://phabricator.wikimedia.org/T310385 (10Dzahn) done! added @XCollazo-WMF to https://phabricator.wikimedia.org/tag/wmf-nda/
[22:36:31] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48250 bytes in 5.830 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:37:15] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.563 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:37:52] <wikibugs>	 10SRE, 10Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (10Dzahn) I also saw certificate errors pop up in a different project that uses a local puppetmaster. And we felt like we had not touched anything. Did not...
[22:38:56] <wikibugs>	 (03PS1) 10BCornwall: Traffic: Reorganize into more, smaller files [alerts] - 10https://gerrit.wikimedia.org/r/805241
[22:40:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T310011)', diff saved to https://phabricator.wikimedia.org/P29693 and previous config saved to /var/cache/conftool/dbconfig/20220613-224014-marostegui.json
[22:40:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:40:20] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[22:49:19] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 58090 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[22:51:21] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:55:10] <wikibugs>	 10SRE, 10Traffic: fawiki user reports getting 503 errors with message "upstream connect error or disconnect before headers" - https://phabricator.wikimedia.org/T310450 (10Huji) Likely. But the point about an error message shown which appears to only exist in unit test code is also worth investigating.
[22:55:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P29694 and previous config saved to /var/cache/conftool/dbconfig/20220613-225519-marostegui.json
[22:55:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:03:21] <wikibugs>	 10SRE, 10Shellbox, 10serviceops: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10RLazarus)
[23:03:33] <wikibugs>	 10SRE, 10Shellbox, 10serviceops: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10RLazarus) p:05Triage→03Medium
[23:09:01] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job cassandra in analytics@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:10:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P29695 and previous config saved to /var/cache/conftool/dbconfig/20220613-231024-marostegui.json
[23:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:11:37] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:12:02] <mutante>	 ^ that would be me because I rebooted that.. not expecting it though
[23:12:09] <icinga-wm>	 PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:12:55] <icinga-wm>	 PROBLEM - BGP status on cr3-knams is CRITICAL: BGP CRITICAL - AS13030/IPv4: Idle - Init7, AS13030/IPv6: Idle - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:14:13] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:16:15] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:16:30] <mutante>	 !log gitlab-runner2001 - systemctl reset-failed to clear alert about failed ifup for ens14 which is actually up. race condiation caused by reboot
[23:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:21] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Make new topic tool available as opt-out almost everywhere (phase 4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805245 (https://phabricator.wikimedia.org/T310392)
[23:19:03] <wikibugs>	 10SRE, 10Shellbox, 10serviceops: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10Legoktm) Did we determine whether the most recent spike was legitimate user traffic or malicious/DoS?  The Abstract Wikipedia team has a proposal somewhere for rendering some fragments async, we could...
[23:21:19] <wikibugs>	 (03PS11) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809)
[23:21:31] <wikibugs>	 (03PS10) 10Tim Starling: Clean up scap sequencing workaround for I0cd5dbeab0e6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836
[23:21:59] <wikibugs>	 10SRE, 10Shellbox, 10serviceops: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10Legoktm) Also, one of the Wikisources has some Lua magic that renders each score like 4 times because they're PNGs. I think if we switched to/enabled SVG rendering (T49578) we could cut that down to j...
[23:22:06] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Disable DiscussionTools' visualenhancements feature in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804395 (owner: 10Esanders)
[23:22:35] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805245 (https://phabricator.wikimedia.org/T310392) (owner: 10Bartosz Dziewoński)
[23:22:38] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804395 (owner: 10Esanders)
[23:23:49] <icinga-wm>	 RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:24:27] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling)
[23:24:31] <icinga-wm>	 RECOVERY - BGP status on cr3-knams is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:25:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T310011)', diff saved to https://phabricator.wikimedia.org/P29696 and previous config saved to /var/cache/conftool/dbconfig/20220613-232529-marostegui.json
[23:25:31] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[23:25:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[23:25:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:36] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[23:25:38] <wikibugs>	 (03Merged) 10jenkins-bot: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling)
[23:25:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T310011)', diff saved to https://phabricator.wikimedia.org/P29697 and previous config saved to /var/cache/conftool/dbconfig/20220613-232537-marostegui.json
[23:25:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:35] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/803902 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[23:29:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:29:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:30:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:30:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:54] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: T134809 g 799685 codfw master DBs (duration: 03m 30s)
[23:30:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:58] <stashbot>	 T134809: App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809
[23:31:04] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis)
[23:31:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[23:31:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:11] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/etcd.php: T134809 g 799685 codfw master DBs (duration: 03m 36s)
[23:35:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:38:32] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Clean up scap sequencing workaround for I0cd5dbeab0e6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 (owner: 10Tim Starling)
[23:39:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10KFrancis) -Confirming the NDA has been signed.  Please proceed with the access request.  Thanks!
[23:39:20] <wikibugs>	 (03Merged) 10jenkins-bot: Clean up scap sequencing workaround for I0cd5dbeab0e6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 (owner: 10Tim Starling)
[23:45:26] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: T134809 g 801836 remove variable wmgDbconfigFromEtcd (duration: 03m 26s)
[23:45:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:45:32] <stashbot>	 T134809: App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809
[23:46:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:47:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:47:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:47:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:47:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:47:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:49:08] <wikibugs>	 (03PS2) 10Legoktm: mediawiki: Disable useless mostlinkedcategories update job [puppet] - 10https://gerrit.wikimedia.org/r/804803 (https://phabricator.wikimedia.org/T310456)
[23:49:10] <wikibugs>	 (03PS2) 10Legoktm: mediawiki: Remove absented mostlinkedcategories job [puppet] - 10https://gerrit.wikimedia.org/r/804804
[23:50:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[23:50:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T310011)', diff saved to https://phabricator.wikimedia.org/P29698 and previous config saved to /var/cache/conftool/dbconfig/20220613-235053-marostegui.json
[23:50:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:56] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[23:52:48] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] mediawiki: Disable useless mostlinkedcategories update job [puppet] - 10https://gerrit.wikimedia.org/r/804803 (https://phabricator.wikimedia.org/T310456) (owner: 10Legoktm)
[23:56:36] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] mediawiki: Remove absented mostlinkedcategories job [puppet] - 10https://gerrit.wikimedia.org/r/804804 (owner: 10Legoktm)
[23:57:38] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook