[00:00:04] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[00:04:40] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[00:10:35] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Manage /etc/inputrc using Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819016 (https://phabricator.wikimedia.org/T293614) (owner: 10Lucas Werkmeister (WMDE))
[00:11:38] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[00:16:12] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:20:54] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[00:25:12] <icinga-wm>	 PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-07-26 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:27:56] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[00:29:36] <icinga-wm>	 PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-07-26 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:30:22] <icinga-wm>	 PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-07-26 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:35:50] <icinga-wm>	 PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-07-26 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:37:02] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[01:01:40] <icinga-wm>	 RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-08-02 00:00:01 (3312 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:03:24] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[01:16:00] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:16:21] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul)
[01:22:16] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul)
[01:35:12] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[01:39:16] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:40:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:49:10] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[01:50:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:56:32] <icinga-wm>	 PROBLEM - Disk space on gitlab2002 is CRITICAL: DISK CRITICAL - free space: /srv/gitlab-backup 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab2002&var-datasource=codfw+prometheus/ops
[02:04:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:07:07] <wikibugs>	 (03PS1) 10Stang: tkwiki: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819774 (https://phabricator.wikimedia.org/T314435)
[02:08:54] <icinga-wm>	 RECOVERY - ps1-b2-codfw-infeed-load-tower-B-phase-Y on ps1-b2-codfw is OK: SNMP OK - ps1-b2-codfw-infeed-load-tower-B-phase-Y 189 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:08:54] <icinga-wm>	 RECOVERY - ps1-b2-codfw-infeed-load-tower-A-phase-X on ps1-b2-codfw is OK: SNMP OK - ps1-b2-codfw-infeed-load-tower-A-phase-X 273 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:08:54] <icinga-wm>	 RECOVERY - ps1-b2-codfw-infeed-load-tower-A-phase-Y on ps1-b2-codfw is OK: SNMP OK - ps1-b2-codfw-infeed-load-tower-A-phase-Y 161 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:08:56] <icinga-wm>	 RECOVERY - ps1-b2-codfw-infeed-load-tower-B-phase-X on ps1-b2-codfw is OK: SNMP OK - ps1-b2-codfw-infeed-load-tower-B-phase-X 295 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:09:08] <icinga-wm>	 RECOVERY - ps1-b2-codfw-infeed-load-tower-A-phase-Z on ps1-b2-codfw is OK: SNMP OK - ps1-b2-codfw-infeed-load-tower-A-phase-Z 311 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:09:44] <icinga-wm>	 RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-08-02 00:00:02 (3312 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:09:50] <icinga-wm>	 RECOVERY - ps1-b2-codfw-infeed-load-tower-B-phase-Z on ps1-b2-codfw is OK: SNMP OK - ps1-b2-codfw-infeed-load-tower-B-phase-Z 321 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:15:38] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[02:18:46] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:20:36] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:22:48] <jinxer-wm>	 (Device rebooted) firing: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[02:27:48] <jinxer-wm>	 (Device rebooted) resolved: Device ps1-b4-codfw.mgmt.codfw.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[02:30:02] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10Sustainability: Enable bracketed-paste-mode for production shells (e.g. deployment, mwmaint) - https://phabricator.wikimedia.org/T293614 (10ori) `enable-bracketed-paste` is on by default starting with Bash 5.1, which is the version in bullseye....
[02:32:56] <jinxer-wm>	 (Device rebooted) firing: Alert for device ps1-b5-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[02:33:28] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[02:34:54] <icinga-wm>	 RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-08-02 00:00:01 (3333 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:37:56] <jinxer-wm>	 (Device rebooted) resolved: Device ps1-b5-codfw.mgmt.codfw.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[02:47:09] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul)
[02:49:45] <wikibugs>	 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul)
[03:01:50] <icinga-wm>	 RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-08-02 00:00:02 (3333 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:06:06] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[03:34:22] <icinga-wm>	 PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[03:36:32] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[03:37:12] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-whole-mediawiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:39:06] <icinga-wm>	 RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[03:46:24] <wikibugs>	 (03CR) 10Andrea Denisse: librenms: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810323 (owner: 10Muehlenhoff)
[03:50:32] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[03:57:18] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[04:04:06] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[04:11:04] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[04:14:12] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[04:20:24] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[04:36:52] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[04:50:54] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[04:55:36] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[05:00:16] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[05:00:32] <icinga-wm>	 PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[05:05:42] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:07:16] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:09:42] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[05:09:58] <icinga-wm>	 RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[05:19:56] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[05:21:22] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[05:30:46] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[05:32:14] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[05:32:56] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[05:33:24] <icinga-wm>	 PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[05:33:48] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[05:34:14] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[05:35:48] <icinga-wm>	 RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[05:36:26] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[05:38:09] <rzl>	 re-paged from 24 hours ago, resolving in VO, no action needed
[05:38:13] <marostegui>	 yeah
[05:38:20] <marostegui>	 We must have forgotten to resolve it yesterday
[05:38:27] <marostegui>	 They can all be resolved
[05:38:39] <marostegui>	 I thought they would resolve automatically yesterday once the process got back up
[05:39:03] <rzl>	 ✅
[05:39:29] <marostegui>	 thanks :*
[05:40:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[05:40:44] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[05:40:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:40:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance
[05:41:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance
[05:41:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T312972)', diff saved to https://phabricator.wikimedia.org/P32187 and previous config saved to /var/cache/conftool/dbconfig/20220803-054106-marostegui.json
[05:41:09] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[05:45:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T312972)', diff saved to https://phabricator.wikimedia.org/P32188 and previous config saved to /var/cache/conftool/dbconfig/20220803-054526-marostegui.json
[06:00:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P32189 and previous config saved to /var/cache/conftool/dbconfig/20220803-060032-marostegui.json
[06:04:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:15:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P32190 and previous config saved to /var/cache/conftool/dbconfig/20220803-061538-marostegui.json
[06:17:46] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:18:36] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:23:29] <wikibugs>	 (03PS2) 10KartikMistry: CX: Set MT threshold for publishing in Armenian WP to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819227 (https://phabricator.wikimedia.org/T313208)
[06:30:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T312972)', diff saved to https://phabricator.wikimedia.org/P32191 and previous config saved to /var/cache/conftool/dbconfig/20220803-063045-marostegui.json
[06:30:49] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[06:30:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance
[06:31:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance
[06:31:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 13 hosts with reason: Maintenance
[06:31:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 13 hosts with reason: Maintenance
[06:31:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[06:31:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[06:31:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T312972)', diff saved to https://phabricator.wikimedia.org/P32192 and previous config saved to /var/cache/conftool/dbconfig/20220803-063148-marostegui.json
[06:35:58] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[06:36:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T312972)', diff saved to https://phabricator.wikimedia.org/P32193 and previous config saved to /var/cache/conftool/dbconfig/20220803-063656-marostegui.json
[06:37:00] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[06:37:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2013.codfw.wmnet to cluster codfw and group C
[06:38:06] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:38:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2013.codfw.wmnet to cluster codfw and group C
[06:39:08] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:39:24] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:45:59] <godog>	 !log power up centrallog2002 and prometheus2005 - T310070
[06:46:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:04] <stashbot>	 T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070
[06:47:38] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[06:47:56] <icinga-wm>	 RECOVERY - Host prometheus2005 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms
[06:48:24] <icinga-wm>	 PROBLEM - puppet last run on prometheus2005 is CRITICAL: CRITICAL: Puppet last ran 17 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:48:30] <icinga-wm>	 RECOVERY - Host centrallog2002 is UP: PING OK - Packet loss = 0%, RTA = 31.73 ms
[06:49:28] <icinga-wm>	 PROBLEM - Check systemd state on centrallog2002 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:49:56] <icinga-wm>	 PROBLEM - Check systemd state on prometheus2005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:50:06] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:50:06] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:50:30] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:50:40] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:50:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:52:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P32194 and previous config saved to /var/cache/conftool/dbconfig/20220803-065202-marostegui.json
[06:52:14] <icinga-wm>	 RECOVERY - Check systemd state on prometheus2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:52:27] <wikibugs>	 10SRE, 10Traffic: anycast-healthchecker fails to start after a reboot and before a puppet run - https://phabricator.wikimedia.org/T314457 (10fgiunchedi)
[06:54:25] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:54:42] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[06:54:44] <icinga-wm>	 RECOVERY - puppet last run on prometheus2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:55:47] <godog>	 thanos rule is me, prometheus coming back
[06:56:46] <godog>	 !log grow sda/sdb 3 by 100G on thanos-be1002 - T314275
[06:56:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:49] <stashbot>	 T314275: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275
[06:56:55] <godog>	 !log grow sda/sdb 3 by 100G on thanos-be2003 - T314275
[06:56:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:29] <wikibugs>	 10SRE, 10Traffic: anycast-healthchecker fails to start after a reboot and before a puppet run - https://phabricator.wikimedia.org/T314457 (10ayounsi) p:05Triage→03Low Agreed something needs to be fixed. The upside is that it works as a safeguard, preventing the service to receive live traffic before the fi...
[06:59:22] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[06:59:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T0700).
[07:00:04] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:17] * kart_ is here and will self deploy..
[07:00:27] <moritzm>	 !log draining ganeti2011 T311686
[07:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:30] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[07:00:54] <wikibugs>	 10SRE-swift-storage: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 (10fgiunchedi)
[07:01:03] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] CX: Set MT threshold for publishing in Armenian WP to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819227 (https://phabricator.wikimedia.org/T313208) (owner: 10KartikMistry)
[07:01:47] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[07:02:13] <wikibugs>	 (03Merged) 10jenkins-bot: CX: Set MT threshold for publishing in Armenian WP to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819227 (https://phabricator.wikimedia.org/T313208) (owner: 10KartikMistry)
[07:02:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] wancache: temporarily remove mc-gp2002 from the gutter pool [puppet] - 10https://gerrit.wikimedia.org/r/819634 (owner: 10Giuseppe Lavagetto)
[07:04:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mcrouter_wancache: Replace mc2024 with mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/819656 (https://phabricator.wikimedia.org/T293012) (owner: 10RLazarus)
[07:05:09] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mcrouter_wancache: Replace mc2024 with mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/819656 (https://phabricator.wikimedia.org/T293012) (owner: 10RLazarus)
[07:05:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686
[07:05:35] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[07:05:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686
[07:06:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36601/console" [puppet] - 10https://gerrit.wikimedia.org/r/819697 (https://phabricator.wikimedia.org/T293012) (owner: 10RLazarus)
[07:07:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P32195 and previous config saved to /var/cache/conftool/dbconfig/20220803-070708-marostegui.json
[07:07:44] * kart_ Deploying after testing..
[07:09:37] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[07:09:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:10:18] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Disable notifications rack b5" [puppet] - 10https://gerrit.wikimedia.org/r/819791
[07:11:10] <logmsgbot>	 !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:819227|CX: Set MT threshold for publishing in Armenian WP to 80% (T313208)]] (duration: 03m 49s)
[07:11:13] <stashbot>	 T313208: Adjust the threshold for Armenian to prevent publishing when overall unmodified content is higher than 80% - https://phabricator.wikimedia.org/T313208
[07:12:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Disable notifications rack b5" [puppet] - 10https://gerrit.wikimedia.org/r/819791 (owner: 10Marostegui)
[07:16:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: codfw pdu maintenance
[07:16:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:16:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: codfw pdu maintenance
[07:16:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:16:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: codfw pdu maintenance
[07:17:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: codfw pdu maintenance
[07:17:22] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es[2020-2022].codfw.wmnet with reason: codfw pdu maintenance
[07:17:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es[2020-2022].codfw.wmnet with reason: codfw pdu maintenance
[07:17:51] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance
[07:18:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance
[07:18:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2134,2160].codfw.wmnet with reason: codfw pdu maintenance
[07:18:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2134,2160].codfw.wmnet with reason: codfw pdu maintenance
[07:19:09] <icinga-wm>	 RECOVERY - Check systemd state on centrallog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:19:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance
[07:19:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance
[07:22:09] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:22:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T312972)', diff saved to https://phabricator.wikimedia.org/P32196 and previous config saved to /var/cache/conftool/dbconfig/20220803-072214-marostegui.json
[07:22:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[07:22:18] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[07:22:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[07:22:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[07:22:36] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: redis::multidc: actually install on mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/820065
[07:22:47] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[07:22:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T312972)', diff saved to https://phabricator.wikimedia.org/P32197 and previous config saved to /var/cache/conftool/dbconfig/20220803-072253-marostegui.json
[07:23:07] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Disable notifications on codfw racks [puppet] - 10https://gerrit.wikimedia.org/r/820066 (https://phabricator.wikimedia.org/T310070)
[07:23:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:23:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications on codfw racks [puppet] - 10https://gerrit.wikimedia.org/r/820066 (https://phabricator.wikimedia.org/T310070) (owner: 10Marostegui)
[07:26:47] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: redis::multidc: actually install on mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/820065
[07:28:00] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Disable notifications pdu C rows [puppet] - 10https://gerrit.wikimedia.org/r/820067 (https://phabricator.wikimedia.org/T310145)
[07:28:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36603/console" [puppet] - 10https://gerrit.wikimedia.org/r/820065 (owner: 10Giuseppe Lavagetto)
[07:29:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications pdu C rows [puppet] - 10https://gerrit.wikimedia.org/r/820067 (https://phabricator.wikimedia.org/T310145) (owner: 10Marostegui)
[07:30:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] redis::multidc: actually install on mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/820065 (owner: 10Giuseppe Lavagetto)
[07:33:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis)
[07:36:32] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:37:12] <marostegui>	 ^ expected
[07:41:29] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Marostegui) Databases in c1 and c2 are ready
[07:41:39] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Marostegui) Databases in the remaining B* racks are ready
[07:44:28] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:45:41] <wikibugs>	 (03PS1) 10Jcrespo: [WIP]Update section script to use a stable API rather than the db [software] - 10https://gerrit.wikimedia.org/r/820069
[07:46:25] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2072 [puppet] - 10https://gerrit.wikimedia.org/r/820070 (https://phabricator.wikimedia.org/T313911)
[07:47:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2072 [puppet] - 10https://gerrit.wikimedia.org/r/820070 (https://phabricator.wikimedia.org/T313911) (owner: 10Marostegui)
[07:48:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2072 from dbctl T313911', diff saved to https://phabricator.wikimedia.org/P32199 and previous config saved to /var/cache/conftool/dbconfig/20220803-074806-marostegui.json
[07:48:10] <stashbot>	 T313911: decommission db2072 - https://phabricator.wikimedia.org/T313911
[07:49:21] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2072 [puppet] - 10https://gerrit.wikimedia.org/r/820071 (https://phabricator.wikimedia.org/T313911)
[07:49:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2072.codfw.wmnet
[07:50:45] <wikibugs>	 (03PS1) 10Filippo Giunchedi: o11y: alert on Icinga max check latency [alerts] - 10https://gerrit.wikimedia.org/r/820072 (https://phabricator.wikimedia.org/T314353)
[07:51:35] <wikibugs>	 (03PS1) 10Jcrespo: Add the possibility of searching racks for instances, too [software/pampinus] - 10https://gerrit.wikimedia.org/r/820073 (https://phabricator.wikimedia.org/T283017)
[07:53:54] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[07:54:01] <wikibugs>	 (03CR) 10Elukey: "Thanks Daniel! <3" [puppet] - 10https://gerrit.wikimedia.org/r/819649 (https://phabricator.wikimedia.org/T230178) (owner: 10Dzahn)
[07:54:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[07:59:06] <wikibugs>	 (03PS1) 10Jcrespo: [WIP]Add instance script with increased functionality over section [software] - 10https://gerrit.wikimedia.org/r/820074 (https://phabricator.wikimedia.org/T283017)
[07:59:35] <wikibugs>	 (03Abandoned) 10Jcrespo: [WIP]Update section script to use a stable API rather than the db [software] - 10https://gerrit.wikimedia.org/r/820069 (owner: 10Jcrespo)
[08:04:04] <wikibugs>	 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1021 mgmt flapping - https://phabricator.wikimedia.org/T314413 (10Aklapper)
[08:14:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2072 [puppet] - 10https://gerrit.wikimedia.org/r/820071 (https://phabricator.wikimedia.org/T313911) (owner: 10Marostegui)
[08:15:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:15:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2072.codfw.wmnet
[08:15:40] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2072 - https://phabricator.wikimedia.org/T313911 (10Marostegui) a:03Papaul
[08:15:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/819565 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[08:15:57] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2072 - https://phabricator.wikimedia.org/T313911 (10Marostegui) @Papaul this is ready for you
[08:17:15] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=(appservers|api)-ro,name=codfw
[08:18:13] <wikibugs>	 (03PS3) 10MarcoAurelio: Amend license request contact form per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359)
[08:19:07] <jynus>	 !log stop db2098 for T310070
[08:19:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:10] <stashbot>	 T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070
[08:23:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T312972)', diff saved to https://phabricator.wikimedia.org/P32200 and previous config saved to /var/cache/conftool/dbconfig/20220803-082318-marostegui.json
[08:23:22] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[08:26:14] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[08:28:10] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:35:46] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2022-08-03-082610-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/820075 (https://phabricator.wikimedia.org/T308248)
[08:36:06] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 118 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:38:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P32201 and previous config saved to /var/cache/conftool/dbconfig/20220803-083824-marostegui.json
[08:40:03] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10Krinkle) >>! In T138093#8117992, @ori wrote: > Rolling this out to the high-traffic wikis will be a little bit tricky. When we turn it o...
[08:41:31] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[08:42:08] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10Krinkle) >>! In T138093#8012474, @ori wrote: > Re-ordering duplicate query parameters could be problematic. […] This means that `?action...
[08:46:04] <icinga-wm>	 PROBLEM - Check systemd state on db2107 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:47:29] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[08:49:02] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[08:51:46] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet
[08:53:00] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:53:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:53:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P32202 and previous config saved to /var/cache/conftool/dbconfig/20220803-085330-marostegui.json
[08:54:04] <icinga-wm>	 PROBLEM - Check systemd state on db2109 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:54:06] <icinga-wm>	 PROBLEM - Check systemd state on db2159 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:54:22] <XioNoX>	 !log put the esams-drmrs link in service - T307221
[08:54:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:56:33] <wikibugs>	 (03PS1) 10Jcrespo: Setup temporary arrangement for codfw snapshots durin PDU maintenance [puppet] - 10https://gerrit.wikimedia.org/r/820081
[08:57:27] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet
[08:57:55] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet
[08:58:18] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet
[08:58:41] <logmsgbot>	 !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc2024.codfw.wmnet
[08:58:51] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet
[08:59:15] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2032.codfw.wmnet
[08:59:38] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2031.codfw.wmnet
[08:59:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Setup temporary arrangement for codfw snapshots durin PDU maintenance [puppet] - 10https://gerrit.wikimedia.org/r/820081 (owner: 10Jcrespo)
[09:00:21] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet
[09:00:26] <icinga-wm>	 RECOVERY - Check systemd state on db2159 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:00:33] <logmsgbot>	 !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc2024.codfw.wmnet
[09:01:18] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 on db2159 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 7790.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:01:45] <wikibugs>	 (03PS1) 10Marostegui: dbproxy2002: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820082
[09:02:04] <icinga-wm>	 RECOVERY - Check systemd state on db2107 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:02:42] <icinga-wm>	 RECOVERY - Check systemd state on db2109 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:03:39] <wikibugs>	 (03PS2) 10Jcrespo: Setup temporary arrangement for codfw snapshots durin PDU maintenance [puppet] - 10https://gerrit.wikimedia.org/r/820081
[09:04:19] <wikibugs>	 10SRE, 10Privacy Engineering, 10WMF-Legal, 10Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104 (10Aklapper) Superseded by {T310738}?
[09:04:35] <jynus>	 !log stop backup2006 backup2009 for T310070
[09:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:40] <stashbot>	 T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070
[09:05:11] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet
[09:06:11] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet
[09:06:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "wancache: temporarily remove mc-gp2002 from the gutter pool" [puppet] - 10https://gerrit.wikimedia.org/r/819792
[09:06:51] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Revert "wancache: temporarily remove mc-gp2002 from the gutter pool" [puppet] - 10https://gerrit.wikimedia.org/r/819792
[09:07:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "wancache: temporarily remove mc-gp2002 from the gutter pool" [puppet] - 10https://gerrit.wikimedia.org/r/819792 (owner: 10Giuseppe Lavagetto)
[09:07:50] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2032.codfw.wmnet
[09:08:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T312972)', diff saved to https://phabricator.wikimedia.org/P32203 and previous config saved to /var/cache/conftool/dbconfig/20220803-090836-marostegui.json
[09:08:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[09:08:39] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[09:08:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[09:08:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[09:08:55] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2031.codfw.wmnet
[09:09:07] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[09:09:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T312972)', diff saved to https://phabricator.wikimedia.org/P32204 and previous config saved to /var/cache/conftool/dbconfig/20220803-090912-marostegui.json
[09:10:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T312972)', diff saved to https://phabricator.wikimedia.org/P32205 and previous config saved to /var/cache/conftool/dbconfig/20220803-091019-marostegui.json
[09:10:56] <XioNoX>	 !log configure BGP on the esams-drmrs link - T307221
[09:10:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:51] <wikibugs>	 (03PS4) 10Jbond: reposync: don't ask for confirmation in dry run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114
[09:14:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy2002: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820082 (owner: 10Marostegui)
[09:15:23] <wikibugs>	 10SRE-swift-storage, 10Commons: New broken files (premature end of file) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T284188 (10Aklapper) 05Open→03Declined Unfortunately closing this Phabricator task as no further information has been provided. If this still happens, please...
[09:15:46] <jelto>	 !log power on mc2024
[09:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:54] <icinga-wm>	 PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[09:17:27] <wikibugs>	 (03PS3) 10Btullis: Add DNS SRV records for dse-k8s etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/819565 (https://phabricator.wikimedia.org/T313129)
[09:18:00] <icinga-wm>	 RECOVERY - Host mc2024 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms
[09:19:39] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2090 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820085 (https://phabricator.wikimedia.org/T314109)
[09:20:23] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet
[09:20:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2090 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820085 (https://phabricator.wikimedia.org/T314109) (owner: 10Marostegui)
[09:20:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2090 from dbctl T314109', diff saved to https://phabricator.wikimedia.org/P32206 and previous config saved to /var/cache/conftool/dbconfig/20220803-092053-marostegui.json
[09:20:56] <stashbot>	 T314109: decommission db2090 - https://phabricator.wikimedia.org/T314109
[09:21:36] <icinga-wm>	 RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[09:21:41] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Use conf2005 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/820086 (https://phabricator.wikimedia.org/T310070)
[09:22:28] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route
[09:22:28] <logmsgbot>	 !log oblivian@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99)
[09:23:17] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route
[09:23:19] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
[09:24:05] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route
[09:24:05] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
[09:24:13] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route
[09:24:15] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
[09:24:32] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on db2159 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:25:20] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Decommission db2090 [puppet] - 10https://gerrit.wikimedia.org/r/820087 (https://phabricator.wikimedia.org/T314109)
[09:25:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P32207 and previous config saved to /var/cache/conftool/dbconfig/20220803-092525-marostegui.json
[09:25:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2090.codfw.wmnet
[09:28:03] <wikibugs>	 (03PS1) 10Ayounsi: add include for 2620:0:862:fe08::/64 PTR [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221)
[09:29:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add include for 2620:0:862:fe08::/64 PTR [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi)
[09:29:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[09:31:13] <wikibugs>	 (03CR) 10Ayounsi: "The -1 is expected as the included file is created by the `sre.netbox.dns` cookbook." [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi)
[09:31:25] <wikibugs>	 (03PS1) 10Btullis: Add an option to use the PKI for etcd intra-cluster certificates [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129)
[09:31:34] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance=mc2024 site=codfw tunnel=mc1042_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[09:32:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[09:32:37] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[09:33:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[09:33:52] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@674bb8b]: (no justification provided)
[09:33:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:33:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2090.codfw.wmnet
[09:33:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] lvs: Use conf2005 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/820086 (https://phabricator.wikimedia.org/T310070) (owner: 10Vgutierrez)
[09:34:01] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@674bb8b]: (no justification provided) (duration: 00m 10s)
[09:34:58] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[09:35:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2011.codfw.wmnet with OS bullseye
[09:35:07] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS bullseye
[09:35:19] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add DNS SRV records for dse-k8s etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/819565 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[09:40:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P32208 and previous config saved to /var/cache/conftool/dbconfig/20220803-094032-marostegui.json
[09:41:50] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:42:07] <jelto>	 !log kubectl cordon kubestage2002
[09:42:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:25] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] lvs: Use conf2005 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/820086 (https://phabricator.wikimedia.org/T310070) (owner: 10Vgutierrez)
[09:43:47] <vgutierrez>	 !log rolling restart of pybal in codfw lvs instances - T310070
[09:43:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:50] <stashbot>	 T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070
[09:43:50] <jelto>	 !log kubectl drain --ignore-daemonsets kubestage2002.codfw.wmnet
[09:43:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:16] <wikibugs>	 (03PS13) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701
[09:44:36] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Switch read traffic for etcd to eqiad [dns] - 10https://gerrit.wikimedia.org/r/820092
[09:44:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Decommission db2090 [puppet] - 10https://gerrit.wikimedia.org/r/820087 (https://phabricator.wikimedia.org/T314109) (owner: 10Marostegui)
[09:45:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Switch read traffic for etcd to eqiad [dns] - 10https://gerrit.wikimedia.org/r/820092 (owner: 10Giuseppe Lavagetto)
[09:46:06] <jelto>	 !log kubectl cordon kubernetes2020.codfw.wmnet kubernetes2009.codfw.wmnet kubernetes2010.codfw.wmnet kubernetes2011.codfw.wmnet kubernetes2012.codfw.wmnet
[09:46:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:08] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2090 - https://phabricator.wikimedia.org/T314109 (10Marostegui) @papaul this is ready for you
[09:46:12] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2090 - https://phabricator.wikimedia.org/T314109 (10Marostegui) a:03Papaul
[09:47:49] <jelto>	 !log kubectl drain --ignore-daemonsets kubernetes2020.codfw.wmnet
[09:47:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:38] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[09:48:46] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 49 hosts with reason: PDU swap
[09:49:19] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 49 hosts with reason: PDU swap
[09:49:37] <wikibugs>	 (03CR) 10Btullis: "I have made a related wikitech edit about this change:" [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[09:50:26] <jelto>	 !log kubectl drain --ignore-daemonsets --delete-local-data kubernetes2009.codfw.wmnet
[09:50:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] reposync: don't ask for confirmation in dry run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114 (owner: 10Jbond)
[09:52:52] <jelto>	 !log kubectl drain --ignore-daemonsets --delete-local-data kubernetes2010.codfw.wmnet
[09:52:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2011.codfw.wmnet with reason: host reimage
[09:54:48] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route
[09:54:48] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
[09:54:49] <jelto>	 !Log kubectl drain --ignore-daemonsets --delete-local-data kubernetes2011.codfw.wmnet
[09:54:56] <jelto>	 !log kubectl drain --ignore-daemonsets --delete-local-data kubernetes2011.codfw.wmnet
[09:54:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:32] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=restbase2027.codfw.wmnet
[09:55:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T312972)', diff saved to https://phabricator.wikimedia.org/P32209 and previous config saved to /var/cache/conftool/dbconfig/20220803-095538-marostegui.json
[09:55:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[09:55:41] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[09:55:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[09:56:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T312972)', diff saved to https://phabricator.wikimedia.org/P32210 and previous config saved to /var/cache/conftool/dbconfig/20220803-095559-marostegui.json
[09:56:26] <jelto>	 !log kubectl drain --ignore-daemonsets --delete-local-data kubernetes2012.codfw.wmnet
[09:56:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:41] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2021.codfw.wmnet
[09:56:42] <wikibugs>	 (03CR) 10MMandere: [C: 03+1] "LGTM... The new DC order is correct as per measurements recorded in the aforementioned excel sheet." [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) (owner: 10BCornwall)
[09:56:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2011.codfw.wmnet with reason: host reimage
[09:57:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T312972)', diff saved to https://phabricator.wikimedia.org/P32211 and previous config saved to /var/cache/conftool/dbconfig/20220803-095706-marostegui.json
[09:58:00] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10fgiunchedi)
[09:58:44] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation={create,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[10:00:33] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10fgiunchedi)
[10:01:14] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[10:04:30] <wikibugs>	 10SRE-swift-storage, 10User-fgiunchedi: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 (10fgiunchedi)
[10:04:52] <wikibugs>	 (03CR) 10Jbond: "look good but a few nits" [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[10:12:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P32212 and previous config saved to /var/cache/conftool/dbconfig/20220803-101212-marostegui.json
[10:14:28] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors
[10:14:31] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors
[10:14:35] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.dns.wipe-cache proton.discovery.wmnet on all recursors
[10:14:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2011.codfw.wmnet with OS bullseye
[10:14:39] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proton.discovery.wmnet on all recursors
[10:14:41] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS bullseye completed: - ganeti2011 (**PASS**)   - Downtimed on...
[10:20:19] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubestage2002.codfw.wmnet
[10:21:35] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Setup temporary arrangement for codfw snapshots durin PDU maintenance [puppet] - 10https://gerrit.wikimedia.org/r/820081 (owner: 10Jcrespo)
[10:22:19] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2020.codfw.wmnet
[10:22:32] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2009.codfw.wmnet
[10:22:43] <wikibugs>	 (03PS1) 10Jcrespo: Revert "Setup temporary arrangement for codfw snapshots durin PDU maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/819794
[10:22:48] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2010.codfw.wmnet
[10:23:00] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2011.codfw.wmnet
[10:23:09] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Wait for PDU maintenance to complete to revert." [puppet] - 10https://gerrit.wikimedia.org/r/819794 (owner: 10Jcrespo)
[10:23:15] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2012.codfw.wmnet
[10:26:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add VictorOps CLI tool & escalate_unpaged command (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis)
[10:27:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P32213 and previous config saved to /var/cache/conftool/dbconfig/20220803-102718-marostegui.json
[10:27:48] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors
[10:27:51] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors
[10:27:56] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.dns.wipe-cache proton.discovery.wmnet on all recursors
[10:27:59] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proton.discovery.wmnet on all recursors
[10:28:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[10:28:12] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:29:52] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors
[10:29:55] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors
[10:30:00] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.dns.wipe-cache proton.discovery.wmnet on all recursors
[10:30:03] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proton.discovery.wmnet on all recursors
[10:30:41] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2176 [puppet] - 10https://gerrit.wikimedia.org/r/820098 (https://phabricator.wikimedia.org/T311494)
[10:32:37] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:33:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2176 [puppet] - 10https://gerrit.wikimedia.org/r/820098 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[10:37:38] <jelto>	 !log shutdown kubestage2002 kubernetes2020 kubernetes2009 kubernetes2010 kubernetes2011 kubernetes2012
[10:37:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:09] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2014-2015,2021-2022].codfw.wmnet with reason: PDU maintenance
[10:38:24] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2014-2015,2021-2022].codfw.wmnet with reason: PDU maintenance
[10:38:44] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2022.codfw.wmnet
[10:40:23] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase201[45].codfw.wmnet
[10:41:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet
[10:42:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T312972)', diff saved to https://phabricator.wikimedia.org/P32215 and previous config saved to /var/cache/conftool/dbconfig/20220803-104224-marostegui.json
[10:42:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[10:42:28] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[10:42:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[10:42:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T312972)', diff saved to https://phabricator.wikimedia.org/P32216 and previous config saved to /var/cache/conftool/dbconfig/20220803-104246-marostegui.json
[10:44:19] <icinga-wm>	 PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[10:44:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:45:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:46:47] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubestage2002.codfw.wmnet with reason: PDU swap
[10:47:01] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubestage2002.codfw.wmnet with reason: PDU swap
[10:47:41] <icinga-wm>	 RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[10:48:03] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:48:05] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:49:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2009.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:50:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet
[10:53:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2011.codfw.wmnet to cluster codfw and group C
[10:53:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: WIP klaxon: run VO escalation of unpaged incidents [puppet] - 10https://gerrit.wikimedia.org/r/820100 (https://phabricator.wikimedia.org/T313603)
[10:54:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2011.codfw.wmnet to cluster codfw and group C
[10:54:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: "An outline to run escalate_unpaged periodically, let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/820100 (https://phabricator.wikimedia.org/T313603) (owner: 10Filippo Giunchedi)
[10:54:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:54:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (3) kubernetes2009.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:56:33] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "This can probably be abandoned per ori in T293614#8126628 (leaving it open for additional feedback for a few days)." [puppet] - 10https://gerrit.wikimedia.org/r/819016 (https://phabricator.wikimedia.org/T293614) (owner: 10Lucas Werkmeister (WMDE))
[10:56:36] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10Sustainability: Enable bracketed-paste-mode for production shells (e.g. deployment, mwmaint) - https://phabricator.wikimedia.org/T293614 (10Lucas_Werkmeister_WMDE) That’s great, thanks! In that case I’m happy to close this task and live with my...
[10:59:44] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db2176 [puppet] - 10https://gerrit.wikimedia.org/r/820101 (https://phabricator.wikimedia.org/T311494)
[11:00:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2176 [puppet] - 10https://gerrit.wikimedia.org/r/820101 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[11:02:21] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10fgiunchedi)
[11:04:52] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10fgiunchedi)
[11:08:26] <wikibugs>	 (03PS1) 10Ladsgroup: site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102
[11:09:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102 (owner: 10Ladsgroup)
[11:09:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:12:32] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[11:12:52] <wikibugs>	 (03CR) 10Btullis: Add an option to use the PKI for etcd intra-cluster certificates (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[11:13:12] <wikibugs>	 (03CR) 10CDanis: Add VictorOps CLI tool & escalate_unpaged command (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis)
[11:13:14] <wikibugs>	 (03PS2) 10Ladsgroup: site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102
[11:13:21] <wikibugs>	 10SRE, 10Observability-Logging, 10vm-requests: logstash collector nodes in codfw not row redundant - https://phabricator.wikimedia.org/T313408 (10MoritzMuehlenhoff) >>! In T313408#8124432, @MoritzMuehlenhoff wrote: > Looks good, could you use Row C but wait two days? I'm currently reimaging codfw ganeti node...
[11:13:27] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[11:14:53] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton
[11:14:55] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] WIP klaxon: run VO escalation of unpaged incidents [puppet] - 10https://gerrit.wikimedia.org/r/820100 (https://phabricator.wikimedia.org/T313603) (owner: 10Filippo Giunchedi)
[11:15:35] <wikibugs>	 (03PS1) 10Jbond: peeringdb: refactor suggestions [software/spicerack] - 10https://gerrit.wikimedia.org/r/820103
[11:16:44] <wikibugs>	 (03CR) 10Jbond: PeeringDB API: initial commit (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[11:17:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:17:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:17:43] <_joe_>	 !log depooling codfw services from all traffic
[11:17:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:17:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:18:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:18:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:18:45] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[11:18:46] <wikibugs>	 (03CR) 10Jbond: PeeringDB API: initial commit (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[11:18:47] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[11:21:03] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Track time spent in or waiting for plugins [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651)
[11:21:32] <wikibugs>	 (03PS2) 10Vgutierrez: trafficserver: Track time spent in or waiting for plugins [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651)
[11:21:58] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=restbase-async
[11:22:10] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=restbase-backend
[11:22:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] peeringdb: refactor suggestions [software/spicerack] - 10https://gerrit.wikimedia.org/r/820103 (owner: 10Jbond)
[11:22:54] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=kartotherian
[11:23:26] <wikibugs>	 (03CR) 10Marostegui: "It looks good, let's run a PCC just in case." [puppet] - 10https://gerrit.wikimedia.org/r/820102 (owner: 10Ladsgroup)
[11:24:01] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[11:24:22] <wikibugs>	 10SRE: Allow Wikimedia Maps usage on a private project for an university. - https://phabricator.wikimedia.org/T306467 (10Aklapper) 05Stalled→03Declined Unfortunately closing this Phabricator task as no further information has been provided.  @HermidaVazquez: After you have provided the information asked for...
[11:24:23] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[11:24:35] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[11:25:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:25:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:25:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:26:09] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=wdqs
[11:26:26] <wikibugs>	 (03CR) 10Jbond: Add an option to use the PKI for etcd intra-cluster certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[11:29:20] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10hnowlan)
[11:31:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:31:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:31:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:32:19] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.dns.netbox
[11:35:30] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:37:57] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2022.codfw.wmnet
[11:38:10] <wikibugs>	 (03CR) 10Jbond: add include for 2620:0:862:fe08::/64 PTR (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi)
[11:38:10] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host restbase2022.codfw.wmnet
[11:41:29] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=(kubernetes2020.codfw.wmnet|kubernetes2009.codfw.wmnet|kubernetes2010.codfw.wmnet|kubernetes2011.codfw.wmnet|kubernetes2012.codfw.wmnet|kubestage2002.codfw.wmnet)
[11:42:17] <wikibugs>	 (03CR) 10Jbond: add include for 2620:0:862:fe08::/64 PTR (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi)
[11:43:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T312972)', diff saved to https://phabricator.wikimedia.org/P32217 and previous config saved to /var/cache/conftool/dbconfig/20220803-114301-marostegui.json
[11:43:04] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[11:45:14] <wikibugs>	 (03PS3) 10Ladsgroup: site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102
[11:45:19] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820102 (owner: 10Ladsgroup)
[11:46:36] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/weight=10; selector: name=(kubernetes2019.codfw.wmnet|kubernetes2021.codfw.wmnet|kubernetes2022.codfw.wmnet|kubernetes2018.codfw.wmnet|kubernetes2020.codfw.wmnet)
[11:47:35] <wikibugs>	 (03PS14) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701
[11:49:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:49:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:49:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:49:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:49:30] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cumin2002.codfw.wmnet with reason: PDU maintenance, T310145
[11:49:34] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[11:49:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:49:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:49:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:49:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:49:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:49:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:49:55] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cumin2002.codfw.wmnet with reason: PDU maintenance, T310145
[11:50:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:50:30] <wikibugs>	 (03PS15) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701
[11:50:48] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[11:51:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:54:16] <icinga-wm>	 PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[11:54:37] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2176 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820110 (https://phabricator.wikimedia.org/T311494)
[11:54:50] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Switch read traffic for etcd to eqiad [dns] - 10https://gerrit.wikimedia.org/r/820092
[11:55:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2176 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820110 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[11:56:18] <icinga-wm>	 RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[11:56:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Switch read traffic for etcd to eqiad [dns] - 10https://gerrit.wikimedia.org/r/820092 (owner: 10Giuseppe Lavagetto)
[11:57:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[11:57:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2176 to s1 T311494', diff saved to https://phabricator.wikimedia.org/P32218 and previous config saved to /var/cache/conftool/dbconfig/20220803-115706-marostegui.json
[11:57:10] <stashbot>	 T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494
[11:58:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P32219 and previous config saved to /var/cache/conftool/dbconfig/20220803-115807-marostegui.json
[11:59:07] <wikibugs>	 (03PS1) 10Btullis: cfssl [puppet] - 10https://gerrit.wikimedia.org/r/820111
[11:59:40] <wikibugs>	 (03PS1) 10Marostegui: db2176: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820112 (https://phabricator.wikimedia.org/T311494)
[12:00:30] <icinga-wm>	 RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:30] <wikibugs>	 (03CR) 10Btullis: Add an option to use the PKI for etcd intra-cluster certificates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[12:00:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2176: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820112 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[12:01:33] <wikibugs>	 (03Abandoned) 10Btullis: cfssl [puppet] - 10https://gerrit.wikimedia.org/r/820111 (owner: 10Btullis)
[12:02:28] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10JMeybohm)
[12:03:05] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: add new offline puppetmaster[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/820113 (https://phabricator.wikimedia.org/T314136)
[12:04:12] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: add new offline puppetmaster[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/820113 (https://phabricator.wikimedia.org/T314136)
[12:04:56] <wikibugs>	 (03CR) 10Btullis: "My latest patchset didn't get updated properly. Will fix shortly." [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[12:05:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36606/console" [puppet] - 10https://gerrit.wikimedia.org/r/820113 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond)
[12:07:18] <icinga-wm>	 PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: regen-zoom-level-tilerator-regen.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:10:03] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10JMeybohm)
[12:12:24] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:13:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P32220 and previous config saved to /var/cache/conftool/dbconfig/20220803-121313-marostegui.json
[12:14:45] <wikibugs>	 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi)
[12:16:34] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@614f7b2]: (no justification provided)
[12:16:45] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@614f7b2]: (no justification provided) (duration: 00m 11s)
[12:17:01] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10fgiunchedi)
[12:19:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10EChetty)
[12:19:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10EChetty)
[12:19:33] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10fgiunchedi)
[12:19:53] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10EChetty)
[12:21:18] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:21:52] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10JMeybohm)
[12:23:01] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10fgiunchedi)
[12:23:54] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Doing): Reduce latency of new Scap releases - https://phabricator.wikimedia.org/T292646 (10jnuche)
[12:24:25] <godog>	 jayme: sorry we had conflicting edits on T310145 :(
[12:24:26] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[12:24:50] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10MoritzMuehlenhoff)
[12:25:44] <jayme>	 godog: but I was first :D
[12:26:27] <godog>	 jayme: haha! were you though? I see your edit obliterated some of my edits too
[12:26:38] <godog>	 anyways I'll fix it
[12:26:46] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10JMeybohm)
[12:26:56] <godog>	 ah you did, nicely done jayme 
[12:26:59] <jayme>	 godog: oops, okay
[12:27:08] <jayme>	 hope your stuff is still in there now godog
[12:27:17] <godog>	 it is! 👍 
[12:27:25] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Reducing codfw mobileapps during maintenance [deployment-charts] - 10https://gerrit.wikimedia.org/r/820116
[12:27:33] <jayme>	 cool
[12:27:39] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10JMeybohm)
[12:27:57] <jayme>	 just another edit to be sure it is persistent :-p
[12:28:11] <godog>	 lolz
[12:28:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T312972)', diff saved to https://phabricator.wikimedia.org/P32221 and previous config saved to /var/cache/conftool/dbconfig/20220803-122819-marostegui.json
[12:28:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance
[12:28:25] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[12:28:28] <wikibugs>	 (03PS1) 10Hnowlan: jobqueue: increase num_workers to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/820117 (https://phabricator.wikimedia.org/T300914)
[12:28:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance
[12:28:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance
[12:29:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance
[12:29:24] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Reducing codfw mobileapps during maintenance [deployment-charts] - 10https://gerrit.wikimedia.org/r/820116 (owner: 10Giuseppe Lavagetto)
[12:29:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T312972)', diff saved to https://phabricator.wikimedia.org/P32222 and previous config saved to /var/cache/conftool/dbconfig/20220803-122929-marostegui.json
[12:30:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Reducing codfw mobileapps during maintenance [deployment-charts] - 10https://gerrit.wikimedia.org/r/820116 (owner: 10Giuseppe Lavagetto)
[12:31:19] <wikibugs>	 (03CR) 10Ladsgroup: "Ran it on anything that had db in it:" [puppet] - 10https://gerrit.wikimedia.org/r/820102 (owner: 10Ladsgroup)
[12:31:24] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10fgiunchedi)
[12:33:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T312972)', diff saved to https://phabricator.wikimedia.org/P32223 and previous config saved to /var/cache/conftool/dbconfig/20220803-123336-marostegui.json
[12:33:40] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[12:35:03] <wikibugs>	 (03PS3) 10Vgutierrez: trafficserver: Track time spent in or waiting for plugins [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651)
[12:36:36] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[12:36:43] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10fgiunchedi)
[12:36:43] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[12:39:33] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2090 - https://phabricator.wikimedia.org/T314109 (10Papaul)
[12:40:03] <moritzm>	 !log uploaded openjdk-8 8u342-b07-1~deb10u1  to component/jdk8 for buster-wikimedia (rebuild of latest Java 8 security update)
[12:40:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:12] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2072 - https://phabricator.wikimedia.org/T313911 (10Papaul)
[12:41:15] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2088 - https://phabricator.wikimedia.org/T313797 (10Papaul)
[12:48:28] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[12:48:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P32224 and previous config saved to /var/cache/conftool/dbconfig/20220803-124842-marostegui.json
[12:48:58] <wikibugs>	 (03PS2) 10Btullis: Add an option to use the PKI for etcd intra-cluster certificates [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129)
[12:49:50] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:51:44] <wikibugs>	 (03Abandoned) 10Ssingh: hiera: replace mc2024 with mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/819706 (https://phabricator.wikimedia.org/T293012) (owner: 10Ssingh)
[12:53:24] <wikibugs>	 (03PS1) 10Ssingh: Revert "Revert "Depool codfw for PDU upgrade"" [dns] - 10https://gerrit.wikimedia.org/r/819798
[12:56:29] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2177 [puppet] - 10https://gerrit.wikimedia.org/r/820120 (https://phabricator.wikimedia.org/T311494)
[12:56:52] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.dns.netbox
[12:58:20] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[12:59:10] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2043.codfw.wmnet with reason: T310070
[12:59:13] <stashbot>	 T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070
[12:59:24] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2043.codfw.wmnet with reason: T310070
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T1300).
[13:00:05] <jouncebot>	 hauskatze: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:06] <hauskatze>	 o/
[13:00:41] <wikibugs>	 (03PS1) 10Vgutierrez: Backport several fixex scheduled for 9.1.3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/820121 (https://phabricator.wikimedia.org/T309651)
[13:01:01] <wikibugs>	 (03PS2) 10Vgutierrez: Backport several fixes scheduled for 9.1.3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/820121 (https://phabricator.wikimedia.org/T309651)
[13:01:03] <urbanecm>	 I can deploy today
[13:01:05] <urbanecm>	 hi hauskatze 
[13:01:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Amend license request contact form per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) (owner: 10MarcoAurelio)
[13:03:09] <wikibugs>	 (03PS1) 10Btullis: Add DHCP details for an-airflow1004 [puppet] - 10https://gerrit.wikimedia.org/r/820122 (https://phabricator.wikimedia.org/T312858)
[13:03:45] <wikibugs>	 (03Merged) 10jenkins-bot: Amend license request contact form per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) (owner: 10MarcoAurelio)
[13:03:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P32226 and previous config saved to /var/cache/conftool/dbconfig/20220803-130348-marostegui.json
[13:04:01] <Lucas_WMDE>	 o/
[13:04:22] <urbanecm>	 hauskatze: pulled to mwdebug1001, please have a look
[13:04:26] <hauskatze>	 ack
[13:04:33] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add DHCP details for an-airflow1004 [puppet] - 10https://gerrit.wikimedia.org/r/820122 (https://phabricator.wikimedia.org/T312858) (owner: 10Btullis)
[13:04:50] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:05:04] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2044.codfw.wmnet with reason: T310070
[13:05:08] <stashbot>	 T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070
[13:05:18] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2044.codfw.wmnet with reason: T310070
[13:05:56] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Productionize db2177 [puppet] - 10https://gerrit.wikimedia.org/r/820120 (https://phabricator.wikimedia.org/T311494)
[13:06:31] <hauskatze>	 urbanecm: checked - form appears okay with the fields as requested
[13:06:38] <urbanecm>	 great, syncing
[13:06:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:07:03] <hauskatze>	 I'll ask tzatziki to send a test email once synced just in case, so they can verify everything works
[13:07:10] <urbanecm>	 sounds good
[13:07:15] <wikibugs>	 (03PS1) 10Papaul: Add new pdu model for pdu in rack b3 b6-b8 and c1 [puppet] - 10https://gerrit.wikimedia.org/r/820123 (https://phabricator.wikimedia.org/T310070)
[13:07:26] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[13:07:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2177 [puppet] - 10https://gerrit.wikimedia.org/r/820120 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[13:07:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:07:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:07:57] <wikibugs>	 (03PS2) 10CDanis: Add VictorOps CLI tool & escalate_unpaged command [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603)
[13:08:14] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10bking)
[13:08:31] <wikibugs>	 (03CR) 10Ayounsi: add include for 2620:0:862:fe08::/64 PTR (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi)
[13:09:32] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on kafka-logging2003.codfw.wmnet with reason: pdu
[13:09:46] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on kafka-logging2003.codfw.wmnet with reason: pdu
[13:10:42] <urbanecm>	 scap says `ssh: connect to host mw2259.codfw.wmnet port 22: Connection timed out`
[13:11:14] <cdanis>	 urbanecm: very likely expected, a bunch of codfw machines are being powered off for maintenance on the power distribution equipment
[13:11:29] <urbanecm>	 yep, just mentioning as a good practice
[13:11:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:11:36] <cdanis>	 👍
[13:11:44] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "LGTM based on my (limited) understanding of this; but I did compare it against similar code and looks good." [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[13:12:23] <Lucas_WMDE>	 I guess the codfw machines can be scap pulled afterwards?
[13:12:33] <Lucas_WMDE>	 (or perhaps that’s even part of the power-on procedure? idk)
[13:12:37] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove insetup role from db2176 [puppet] - 10https://gerrit.wikimedia.org/r/820125 (https://phabricator.wikimedia.org/T311494)
[13:12:41] <jbond>	 !log introduce puppetmaster[12]004 for now as offline
[13:12:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:54] <urbanecm>	 i find it interesting that mw2259 is apparently still a part of the jobrunner cluster? https://config-master.wikimedia.org/pybal/codfw/jobrunner
[13:13:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup role from db2176 [puppet] - 10https://gerrit.wikimedia.org/r/820125 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[13:13:49] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36607/console" [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[13:13:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster: add new offline puppetmaster[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/820113 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond)
[13:13:56] <urbanecm>	 and `13:12:29 39 apaches had sync errors`
[13:14:20] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[13:14:28] <cdanis>	 Lucas_WMDE: they are set pooled=inactive in pybal so they won't begin serving traffic automatically upon bootup, and yeah, serviceops will do `scap pull` afterwards, that is standard procedure for returning a machine to service
[13:14:35] <wikibugs>	 (03PS4) 10Eigyan: [config]: Add click event logging for mobile and desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852)
[13:14:37] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "Revert "Depool codfw for PDU upgrade"" [dns] - 10https://gerrit.wikimedia.org/r/819798 (owner: 10Ssingh)
[13:14:39] <Lucas_WMDE>	 cdanis: good to know, thanks!
[13:15:19] <wikibugs>	 (03PS2) 10Ssingh: Revert "Revert "Depool codfw for PDU upgrade"" [dns] - 10https://gerrit.wikimedia.org/r/819798
[13:16:27] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/MetaContactPages.php: f89f02e306a1fa580fa41ba56de978f4208ea672: Amend license request contact form per Legal (T303359) (duration: 09m 27s)
[13:16:31] <stashbot>	 T303359: Remove items from Meta-Wiki page [[Special:Contact/requestlicense]] - https://phabricator.wikimedia.org/T303359
[13:16:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:16:57] <hauskatze>	 \o/
[13:16:58] <urbanecm>	 cdanis: if i'm not misunderstanding https://config-master.wikimedia.org/pybal/codfw/jobrunner, mw2259 is still pooled (and unreachable). it's definitely not pooled=inactive, because scap should ignore inactive hosts.
[13:17:13] <cdanis>	 oh, the jobrunners, hm
[13:17:18] <cdanis>	 _joe_: ^
[13:18:03] <sukhe>	 !log depool codfw for PDU upgrade: CR 819798
[13:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T312972)', diff saved to https://phabricator.wikimedia.org/P32227 and previous config saved to /var/cache/conftool/dbconfig/20220803-131855-marostegui.json
[13:18:56] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7419 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[13:18:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[13:19:00] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[13:19:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[13:19:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T312972)', diff saved to https://phabricator.wikimedia.org/P32228 and previous config saved to /var/cache/conftool/dbconfig/20220803-131916-marostegui.json
[13:19:26] <cdanis>	 urbanecm: well, I can at least verify that mw2259 is in an affected rack (B3)
[13:19:37] <wikibugs>	 (03PS1) 10Btullis: Configure an-airflow1004 to install with buster [puppet] - 10https://gerrit.wikimedia.org/r/820126 (https://phabricator.wikimedia.org/T312858)
[13:20:52] <urbanecm>	 thanks cdanis 
[13:21:10] <urbanecm>	 fwiw there is also an apiserver (mw2317) that's also pooled & unreachable. didn't check the rest of the hosts though.
[13:21:14] <wikibugs>	 (03PS4) 10Vgutierrez: trafficserver: Track time spent in or waiting for plugins [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651)
[13:21:18] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[13:21:45] <cdanis>	 ok, that's also in the affected set of hosts
[13:21:50] <cdanis>	 I'll make sure those get marked as depooled soon
[13:21:54] <urbanecm>	 thanks!
[13:22:20] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36608/console" [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[13:22:32] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10bking)
[13:23:21] <cdanis>	 urbanecm: actually j.ayme from serviceops will check that rn :) thanks for noting
[13:23:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:23:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:24:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:24:23] <wikibugs>	 (03PS16) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701
[13:24:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add VictorOps CLI tool & escalate_unpaged command [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis)
[13:25:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T312972)', diff saved to https://phabricator.wikimedia.org/P32229 and previous config saved to /var/cache/conftool/dbconfig/20220803-132524-marostegui.json
[13:25:27] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[13:25:40] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7195 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[13:26:42] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "The new addition looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[13:27:31] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Configure an-airflow1004 to install with buster [puppet] - 10https://gerrit.wikimedia.org/r/820126 (https://phabricator.wikimedia.org/T312858) (owner: 10Btullis)
[13:28:39] <wikibugs>	 (03PS3) 10CDanis: Add VictorOps CLI tool & escalate_unpaged command [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603)
[13:28:49] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Add VictorOps CLI tool & escalate_unpaged command [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis)
[13:30:43] <moritzm>	 !log installing Java 8 security updates for Buster
[13:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:04] <wikibugs>	 (03Merged) 10jenkins-bot: Add VictorOps CLI tool & escalate_unpaged command [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis)
[13:31:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[13:32:05] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Track time spent in or waiting for plugins [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[13:32:18] <wikibugs>	 (03PS1) 10Jbond: readme: add w-sre to channels [puppet] - 10https://gerrit.wikimedia.org/r/820128
[13:32:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] readme: add w-sre to channels [puppet] - 10https://gerrit.wikimedia.org/r/820128 (owner: 10Jbond)
[13:35:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Cumin aliases for ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/820129
[13:37:16] <wikibugs>	 (03PS7) 10Jbond: C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134 (https://phabricator.wikimedia.org/T138093)
[13:37:53] <jayme>	 cdanis: urbanecm: looks like the data on config-master is "old" confctl shots the nodes as pooled=inactive
[13:38:01] <cdanis>	 ahhhhh
[13:38:08] <jayme>	 enlighten me :-)
[13:38:11] <cdanis>	 hm.
[13:38:15] <cdanis>	 interesting
[13:38:31] <cdanis>	 scap trying to talk to them still is really interesting
[13:38:37] <urbanecm>	 thanks for checking jayme. then scap probably also uses the old confctl state? 
[13:39:10] <cdanis>	 is the thing that generates dsh files for scap -- and config-master state -- perhaps *not* redirected to read from eqiad instead of codfw?
[13:39:54] <wikibugs>	 (03PS2) 10Filippo Giunchedi: klaxon: run VO escalation of unpaged incidents [puppet] - 10https://gerrit.wikimedia.org/r/820100 (https://phabricator.wikimedia.org/T313603)
[13:40:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P32230 and previous config saved to /var/cache/conftool/dbconfig/20220803-134030-marostegui.json
[13:41:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] klaxon: run VO escalation of unpaged incidents [puppet] - 10https://gerrit.wikimedia.org/r/820100 (https://phabricator.wikimedia.org/T313603) (owner: 10Filippo Giunchedi)
[13:42:25] <cdanis>	 this is a part of the infra I never got too deep into sadly
[13:42:26] <jayme>	 hmm...but etcd should be replicated anyways. Just the clients not talking to codfw etcd AIUI
[13:43:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] klaxon: run VO escalation of unpaged incidents [puppet] - 10https://gerrit.wikimedia.org/r/820100 (https://phabricator.wikimedia.org/T313603) (owner: 10Filippo Giunchedi)
[13:43:56] <cdanis>	 ✔️ cdanis@deploy1002.eqiad.wmnet ~ 🕤☕ grep mw2259 /etc/dsh/group/*        
[13:43:58] <cdanis>	 /etc/dsh/group/jobrunner:mw2259.codfw.wmnet
[13:43:59] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Adjust (Total|Active)PluginTime milestones [puppet] - 10https://gerrit.wikimedia.org/r/820132 (https://phabricator.wikimedia.org/T309651)
[13:44:00] <cdanis>	 /etc/dsh/group/mediawiki-installation:mw2259.codfw.wmnet
[13:44:03] <cdanis>	 /etc/dsh/group/scap-proxies:mw2259.codfw.wmnet
[13:44:04] <cdanis>	 /etc/dsh/group/scap_targets:mw2259.codfw.wmnet
[13:44:38] <cdanis>	 confd is producing a bunch of errors on deploy1002
[13:44:39] <wikibugs>	 (03CR) 10EllenR: [C: 03+1] "LGTM, but I would never have guessed that you would put click event in InitializeSettings-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852) (owner: 10Eigyan)
[13:44:41] <cdanis>	 Aug  3 13:43:57 deploy1002 confd[32274]: 2022-08-03T13:43:57Z deploy1002 /usr/bin/confd[32274]: ERROR client: etcd cluster is unavailable or misconfigured; error #0: dial tcp: lookup conf1004.eqiad.wmnet on 10.3.0.1:53: no such host
[13:44:44] <cdanis>	 Aug  3 13:43:57 deploy1002 confd[32274]: ; error #1: dial tcp: lookup conf1006.eqiad.wmnet on 10.3.0.1:53: no such host
[13:44:45] <cdanis>	 Aug  3 13:43:57 deploy1002 confd[32274]: ; error #2: dial tcp: lookup conf1005.eqiad.wmnet on 10.3.0.1:53: no such host
[13:45:57] <vgutierrez>	 hmm conf1005 and conf1006 have been decomm'ed per https://phabricator.wikimedia.org/T311408
[13:46:29] <vgutierrez>	 new ones are conf100[789]
[13:46:41] <taavi>	 maybe confd needs a restart to pick up changes to the srv record?
[13:46:43] <cdanis>	 yeah and the SRV record looks correct
[13:46:48] <cdanis>	 that would be really asinine but I'm going to do it
[13:46:58] <cdanis>	 !log ✔️ cdanis@deploy1002.eqiad.wmnet ~ 🕙☕ sudo systemctl restart confd
[13:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:20] <cdanis>	 heh
[13:47:26] <cdanis>	 on restart it did update everything
[13:47:28] <jynus>	 that worked?
[13:47:30] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Adjust (Total|Active)PluginTime milestones [puppet] - 10https://gerrit.wikimedia.org/r/820132 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[13:47:33] <cdanis>	 ✔️ cdanis@deploy1002.eqiad.wmnet ~ 🕙☕ grep mw2259 /etc/dsh/group/*
[13:47:35] <cdanis>	 /etc/dsh/group/scap-proxies:mw2259.codfw.wmnet
[13:47:38] <cdanis>	 /etc/dsh/group/scap_targets:mw2259.codfw.wmnet
[13:47:39] <cdanis>	 no longer in jobrunners
[13:48:15] <cdanis>	 jayme: so it looks like we need to do a global restart of confd too
[13:48:33] <jynus>	 old confctl data in non obvious was is not the first time it hit us- I think something weird happened on a dc switchover too
[13:48:41] <jynus>	 *non-obvious ways
[13:48:48] <cdanis>	 and probably uh
[13:48:58] <cdanis>	 figure out how to monitor for confd being stuck in an unhealthy state for *days*
[13:49:05] <jynus>	 +1
[13:49:15] <jynus>	 both on etcd and on client side
[13:50:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: klaxon: fix escalate_unpaged usage [puppet] - 10https://gerrit.wikimedia.org/r/820135 (https://phabricator.wikimedia.org/T313603)
[13:50:33] <godog>	 cdanis: ^
[13:50:53] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] klaxon: fix escalate_unpaged usage [puppet] - 10https://gerrit.wikimedia.org/r/820135 (https://phabricator.wikimedia.org/T313603) (owner: 10Filippo Giunchedi)
[13:53:49] <cdanis>	 does anyone object to me restarting confd across the fleet?
[13:54:07] <icinga-wm>	 RECOVERY - puppetmaster backend https on puppetmaster1004 is OK: HTTP OK: Status line output matched 400 - 417 bytes in 1.447 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[13:54:23] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[13:54:46] <jayme>	 cdanis: thanks for looking & taking care - I have none
[13:54:50] <jynus>	 ^ cdanis let's wait to see what that is
[13:55:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P32231 and previous config saved to /var/cache/conftool/dbconfig/20220803-135536-marostegui.json
[13:56:22] <jynus>	 I see an increase in latency, but not a recent one: https://grafana.wikimedia.org/goto/BH-MUfk4z?orgId=1
[13:56:59] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[13:57:13] <cdanis>	 heh
[13:57:21] <cdanis>	 I'm going to proceed
[13:57:39] <cdanis>	 !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕙☕ sudo cumin 'P{R:Class = Confd}' 'systemctl restart confd'     
[13:57:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:23] <cdanis>	 done
[13:58:49] <icinga-wm>	 RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11345 MB (32% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[14:00:13] <icinga-wm>	 PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[14:01:31] <wikibugs>	 (03PS1) 10Cwhite: scap: add option to selectivlely disable bootstrapping [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559)
[14:02:24] <icinga-wm>	 RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[14:02:43] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:04:02] <wikibugs>	 (03PS2) 10Cwhite: scap: add option to selectivlely disable bootstrapping [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559)
[14:04:13] <wikibugs>	 (03PS1) 10Ladsgroup: auto_schema: Start depooling codfw replicas [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486)
[14:04:19] <jynus>	 ah, I just noticed it is codfw only
[14:04:22] <jynus>	 sorry
[14:06:34] <moritzm>	 !log installing freetype security updates on bullseye
[14:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T312972)', diff saved to https://phabricator.wikimedia.org/P32232 and previous config saved to /var/cache/conftool/dbconfig/20220803-141042-marostegui.json
[14:10:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance
[14:10:46] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[14:10:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance
[14:11:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1109 (T312972)', diff saved to https://phabricator.wikimedia.org/P32233 and previous config saved to /var/cache/conftool/dbconfig/20220803-141103-marostegui.json
[14:11:39] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:12:29] <wikibugs>	 (03PS1) 10Jbond: C:puppetmaster::ssl: ensure we generate the ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/820144 (https://phabricator.wikimedia.org/T314136)
[14:12:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102 (owner: 10Ladsgroup)
[14:13:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T312972)', diff saved to https://phabricator.wikimedia.org/P32234 and previous config saved to /var/cache/conftool/dbconfig/20220803-141310-marostegui.json
[14:13:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:puppetmaster::ssl: ensure we generate the ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/820144 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond)
[14:14:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36609/console" [puppet] - 10https://gerrit.wikimedia.org/r/820144 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond)
[14:16:03] <wikibugs>	 (03PS1) 10CDanis: Add helpful note about confd [dns] - 10https://gerrit.wikimedia.org/r/820145 (https://phabricator.wikimedia.org/T314489)
[14:16:51] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:18:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff)
[14:18:33] <cdanis>	 urbanecm: again thanks very much for pointing that out earlier, I filed our own task & a bug upstream against confd
[14:19:10] <urbanecm>	 cdanis: thanks for figuring it out so quickly!
[14:19:27] <wikibugs>	 (03PS2) 10Jbond: C:puppetmaster::ssl: ensure we generate the ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/820144 (https://phabricator.wikimedia.org/T314136)
[14:19:32] <cdanis>	 apparently this is a known issue but not a well-documented or well-known-enough issue :?
[14:19:35] <cdanis>	 :/
[14:20:41] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:20:50] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Michael) Glancing at the repository, I'm not sure if there is anything that you need from us to migrate `wikibase/termbox` on Wikidata? Though I'm not very familiar with this particular part...
[14:20:52] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Add helpful note about confd [dns] - 10https://gerrit.wikimedia.org/r/820145 (https://phabricator.wikimedia.org/T314489) (owner: 10CDanis)
[14:22:50] <wikibugs>	 (03CR) 10Jbond: "couple more minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[14:25:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:27:14] <moritzm>	 !log upgrading ganeti/esams to Ganeti 3.0.2 T312637
[14:27:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:18] <stashbot>	 T312637: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637
[14:28:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P32235 and previous config saved to /var/cache/conftool/dbconfig/20220803-142816-marostegui.json
[14:28:21] <jelto>	 !log power off thumbor2003 and thumbor2004
[14:28:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:29] <icinga-wm>	 PROBLEM - Host thumbor2004 is DOWN: PING CRITICAL - Packet loss = 100%
[14:28:36] <wikibugs>	 (03CR) 10Jbond: Add helpful note about confd (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820145 (https://phabricator.wikimedia.org/T314489) (owner: 10CDanis)
[14:29:51] <icinga-wm>	 PROBLEM - Host thumbor2003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:29:57] <icinga-wm>	 PROBLEM - Host conf2004 is DOWN: PING CRITICAL - Packet loss = 100%
[14:30:06] <jayme>	 this is fine
[14:30:44] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Add helpful note about confd (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820145 (https://phabricator.wikimedia.org/T314489) (owner: 10CDanis)
[14:31:08] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on thumbor[2003-2004].codfw.wmnet with reason: PDU swap
[14:31:21] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs[2005-2008].codfw.wmnet with reason: PDU work
[14:31:22] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on thumbor[2003-2004].codfw.wmnet with reason: PDU swap
[14:31:37] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs[2005-2008].codfw.wmnet with reason: PDU work
[14:31:45] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=664eda2d-5203-44ca-92c1-3213c3996b5f) set by mvernon@cumin1001 for 1 day, 0:00:00 on 4 host(s) and t...
[14:31:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36610/console" [puppet] - 10https://gerrit.wikimedia.org/r/820144 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond)
[14:32:08] <Emperor>	 !log shutdown aqs200[5-8] prior to PDU work T310070
[14:32:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:11] <stashbot>	 T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070
[14:33:39] <wikibugs>	 (03CR) 10Btullis: Add an option to use the PKI for etcd intra-cluster certificates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[14:33:51] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2029.codfw.wmnet with reason: T310070
[14:34:05] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2029.codfw.wmnet with reason: T310070
[14:34:29] <icinga-wm>	 PROBLEM - Host mw2314.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:29] <icinga-wm>	 PROBLEM - Host mw2315.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:29] <icinga-wm>	 PROBLEM - Host mw2317.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:29] <icinga-wm>	 PROBLEM - Host mw2316.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:29] <icinga-wm>	 PROBLEM - Host mw2318.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:30] <icinga-wm>	 PROBLEM - Host mw2319.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:30] <icinga-wm>	 PROBLEM - Host mw2321.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:31] <icinga-wm>	 PROBLEM - Host mw2320.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:puppetmaster::ssl: ensure we generate the ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/820144 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond)
[14:35:07] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:35:47] <icinga-wm>	 PROBLEM - Host mw2324.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:36:27] <icinga-wm>	 PROBLEM - Host mw2311.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:37:11] <icinga-wm>	 PROBLEM - Host mw2322.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:37:12] <icinga-wm>	 PROBLEM - Host mw2323.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:38:03] <wikibugs>	 (03CR) 10Jbond: Add an option to use the PKI for etcd intra-cluster certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[14:38:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add an option to use the PKI for etcd intra-cluster certificates [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[14:39:24] <wikibugs>	 (03PS3) 10Btullis: Add an option to use the PKI for etcd intra-cluster certificates [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129)
[14:39:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[14:39:59] <icinga-wm>	 PROBLEM - Host mw2310.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:39:59] <icinga-wm>	 PROBLEM - Host mw2312.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:40:02] <icinga-wm>	 PROBLEM - Host mw2313.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:15] <icinga-wm>	 PROBLEM - Host ores2003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:27] <icinga-wm>	 PROBLEM - Host restbase2021.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10fgiunchedi) a:05herron→03RobH Hi @robh,  re: racking since this is an expansion please allocate to new rows (compared to existing kafka-logging...
[14:42:57] <icinga-wm>	 PROBLEM - Host mw2259.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:57] <icinga-wm>	 PROBLEM - Host mw2261.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:57] <icinga-wm>	 PROBLEM - Host mw2262.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:57] <icinga-wm>	 PROBLEM - Host mw2260.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:57] <icinga-wm>	 PROBLEM - Host mw2264.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:58] <icinga-wm>	 PROBLEM - Host mw2263.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:58] <icinga-wm>	 PROBLEM - Host mw2266.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:59] <icinga-wm>	 PROBLEM - Host mw2265.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:59] <icinga-wm>	 PROBLEM - Host mw2268.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:43:00] <icinga-wm>	 PROBLEM - Host mw2267.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:43:00] <icinga-wm>	 PROBLEM - Host mw2269.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:43:07] <icinga-wm>	 PROBLEM - Host db2123.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:43:18] <cdanis>	 do we not downtime mgmt? heh
[14:43:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P32236 and previous config saved to /var/cache/conftool/dbconfig/20220803-144322-marostegui.json
[14:44:31] <icinga-wm>	 PROBLEM - Host db2108.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:44:42] <icinga-wm>	 PROBLEM - Host es2021.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:45:09] <icinga-wm>	 RECOVERY - puppetmaster backend https on puppetmaster2004 is OK: HTTP OK: Status line output matched 400 - 417 bytes in 1.522 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[14:45:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:45:33] <icinga-wm>	 PROBLEM - Host conf2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:45:33] <icinga-wm>	 PROBLEM - Host mw2270.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:46:00] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:46:02] <wikibugs>	 (03PS1) 10Jbond: puppet::ssl: update private permisions not public [puppet] - 10https://gerrit.wikimedia.org/r/820147
[14:46:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:46:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:46:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet::ssl: update private permisions not public [puppet] - 10https://gerrit.wikimedia.org/r/820147 (owner: 10Jbond)
[14:46:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet::ssl: update private permisions not public [puppet] - 10https://gerrit.wikimedia.org/r/820147 (owner: 10Jbond)
[14:46:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:47:21] <icinga-wm>	 PROBLEM - Host ps1-b3-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:47:29] <icinga-wm>	 PROBLEM - Host ores2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:47:31] <logmsgbot>	 !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.21 (duration: 06m 13s)
[14:48:45] <icinga-wm>	 PROBLEM - Host thumbor2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:49:13] <icinga-wm>	 PROBLEM - Juniper alarms on asw-b-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[14:49:21] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[14:50:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:50:37] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7169 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[14:51:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:52:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:52:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:52:51] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:53:25] <logmsgbot>	 !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.19 (duration: 05m 37s)
[14:53:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:54:09] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[14:55:53] <wikibugs>	 (03CR) 10Jbond: add include for 2620:0:862:fe08::/64 PTR (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi)
[14:55:57] <wikibugs>	 10SRE-OnFire, 10observability, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10lmata) p:05Triage→03Medium
[14:56:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:56:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:56:09] <icinga-wm>	 RECOVERY - Check systemd state on search-loader1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:15] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:56:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:58:05] <wikibugs>	 10SRE-OnFire, 10observability, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) p:05Medium→03High
[14:58:13] <wikibugs>	 10SRE-OnFire, 10observability, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) 05Open→03Resolved
[14:58:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T312972)', diff saved to https://phabricator.wikimedia.org/P32237 and previous config saved to /var/cache/conftool/dbconfig/20220803-145828-marostegui.json
[14:58:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[14:58:32] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[14:58:44] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[14:58:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:58:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T312972)', diff saved to https://phabricator.wikimedia.org/P32238 and previous config saved to /var/cache/conftool/dbconfig/20220803-145849-marostegui.json
[14:59:50] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on mc2023.codfw.wmnet with reason: PDU swap
[14:59:52] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mc2023.codfw.wmnet with reason: PDU swap
[14:59:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T312972)', diff saved to https://phabricator.wikimedia.org/P32239 and previous config saved to /var/cache/conftool/dbconfig/20220803-145956-marostegui.json
[15:00:38] <wikibugs>	 (03CR) 10Cwhite: "This is only one option.  I've also seen it "disabled" by setting scap::deployment_server to $fqdn." [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) (owner: 10Cwhite)
[15:01:19] <icinga-wm>	 RECOVERY - Host thumbor2003.mgmt is UP: PING WARNING - Packet loss = 66%, RTA = 501.02 ms
[15:01:21] <icinga-wm>	 RECOVERY - Juniper alarms on asw-b-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[15:01:21] <icinga-wm>	 RECOVERY - Host mw2259.mgmt is UP: PING OK - Packet loss = 0%, RTA = 60.35 ms
[15:01:21] <icinga-wm>	 RECOVERY - Host mw2260.mgmt is UP: PING OK - Packet loss = 0%, RTA = 55.06 ms
[15:01:21] <icinga-wm>	 RECOVERY - Host mw2261.mgmt is UP: PING OK - Packet loss = 0%, RTA = 55.12 ms
[15:01:22] <icinga-wm>	 RECOVERY - Host mw2262.mgmt is UP: PING OK - Packet loss = 0%, RTA = 52.35 ms
[15:01:22] <icinga-wm>	 RECOVERY - Host mw2264.mgmt is UP: PING OK - Packet loss = 0%, RTA = 54.46 ms
[15:01:22] <icinga-wm>	 RECOVERY - Host mw2263.mgmt is UP: PING OK - Packet loss = 0%, RTA = 52.39 ms
[15:01:23] <icinga-wm>	 RECOVERY - Host mw2266.mgmt is UP: PING OK - Packet loss = 0%, RTA = 48.71 ms
[15:01:23] <icinga-wm>	 RECOVERY - Host mw2265.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.53 ms
[15:01:24] <icinga-wm>	 RECOVERY - Host mw2268.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.33 ms
[15:01:24] <icinga-wm>	 RECOVERY - Host mw2269.mgmt is UP: PING OK - Packet loss = 0%, RTA = 40.75 ms
[15:01:25] <icinga-wm>	 RECOVERY - Host mw2267.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.81 ms
[15:02:25] <icinga-wm>	 RECOVERY - Host ores2003 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms
[15:03:07] <icinga-wm>	 RECOVERY - Host es2021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.24 ms
[15:03:11] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7340 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[15:04:00] <jelto>	 !log power off mc2023
[15:04:01] <icinga-wm>	 RECOVERY - Host conf2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.52 ms
[15:04:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:01] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[15:05:41] <icinga-wm>	 RECOVERY - Host conf2004 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms
[15:06:27] <icinga-wm>	 RECOVERY - Host ores2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.55 ms
[15:07:01] <icinga-wm>	 RECOVERY - Host thumbor2003 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms
[15:07:07] <icinga-wm>	 RECOVERY - Host mw2311.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.53 ms
[15:07:17] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:07:27] <icinga-wm>	 RECOVERY - Host restbase2021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.84 ms
[15:07:37] <icinga-wm>	 RECOVERY - Host thumbor2004 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms
[15:08:01] <icinga-wm>	 RECOVERY - Host db2123.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms
[15:08:39] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) >>! In T138093#8127054, @Krinkle wrote: > Quick drive-by note here, feel free to ignore if a false alarm. MediaWiki has a concept o...
[15:09:27] <icinga-wm>	 RECOVERY - Host db2108.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms
[15:10:01] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for conf2004.codfw.wmnet
[15:10:01] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for conf2004.codfw.wmnet
[15:10:07] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7402 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[15:10:12] <urbanecm>	 jouncebot: nowandnext
[15:10:13] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 49 minute(s)
[15:10:13] <jouncebot>	 In 2 hour(s) and 49 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T1800)
[15:10:13] <jouncebot>	 In 2 hour(s) and 49 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T1800)
[15:10:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:10:21] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] ServiceImageRecommendationProvider: Add extra logging when no JSON response received [extensions/GrowthExperiments] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819075 (https://phabricator.wikimedia.org/T313973) (owner: 10Urbanecm)
[15:10:27] <godog>	 win 44
[15:10:34] <godog>	 lose 666
[15:10:35] <icinga-wm>	 RECOVERY - Host mw2315.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms
[15:10:35] <icinga-wm>	 RECOVERY - Host mw2314.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms
[15:10:35] <icinga-wm>	 RECOVERY - Host mw2317.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms
[15:10:35] <icinga-wm>	 RECOVERY - Host mw2318.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.85 ms
[15:10:35] <icinga-wm>	 RECOVERY - Host mw2316.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.65 ms
[15:10:39] <icinga-wm>	 RECOVERY - Host mw2310.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.36 ms
[15:10:39] <icinga-wm>	 RECOVERY - Host mw2312.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.00 ms
[15:10:41] <icinga-wm>	 RECOVERY - Host mw2313.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms
[15:11:06] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[15:11:29] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[15:11:40] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "I have no objection to this approach.  I'm okay with merging it today for expediency.  We can change it later if needed." [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) (owner: 10Cwhite)
[15:12:53] <icinga-wm>	 RECOVERY - Host mw2324.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.14 ms
[15:13:23] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] "looks right from the scap side anyway 😐" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[15:14:15] <icinga-wm>	 RECOVERY - Host mw2322.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms
[15:14:15] <icinga-wm>	 RECOVERY - Host mw2323.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.66 ms
[15:15:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P32240 and previous config saved to /var/cache/conftool/dbconfig/20220803-151502-marostegui.json
[15:16:25] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[15:17:03] <icinga-wm>	 RECOVERY - Host mw2321.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms
[15:17:03] <icinga-wm>	 RECOVERY - Host mw2319.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms
[15:17:03] <icinga-wm>	 RECOVERY - Host mw2320.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms
[15:17:14] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Switch read traffic for etcd to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/819802
[15:17:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Switch read traffic for etcd to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/819802 (owner: 10JMeybohm)
[15:18:49] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2002 is CRITICAL: 70 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2002
[15:19:01] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2030.codfw.wmnet with reason: T310070
[15:19:04] <stashbot>	 T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070
[15:19:15] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2030.codfw.wmnet with reason: T310070
[15:19:34] <wikibugs>	 (03PS2) 10JMeybohm: Revert "Switch read traffic for etcd to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/819802
[15:20:03] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2001 is CRITICAL: 70 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001
[15:21:33] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2021.codfw.wmnet
[15:21:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Revert "Switch read traffic for etcd to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/819802 (owner: 10JMeybohm)
[15:22:23] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10bking)
[15:23:21] <icinga-wm>	 RECOVERY - Host mw2270.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.63 ms
[15:24:37] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd._tcp.codfw.wmnet on all recursors
[15:24:40] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.codfw.wmnet on all recursors
[15:24:41] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd._tcp.ulsfo.wmnet on all recursors
[15:24:44] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.ulsfo.wmnet on all recursors
[15:24:45] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd._tcp.eqsin.wmnet on all recursors
[15:24:49] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.eqsin.wmnet on all recursors
[15:25:22] <elukey>	 TIL sre.dns.wipe-cache :)
[15:26:27] <icinga-wm>	 RECOVERY - MegaRAID on ms-be2067 is OK: OK: optimal, 23 logical, 23 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:26:43] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[15:27:29] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[15:27:41] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:30:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P32241 and previous config saved to /var/cache/conftool/dbconfig/20220803-153009-marostegui.json
[15:30:33] <icinga-wm>	 PROBLEM - Host wcqs2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:30:54] <vgutierrez>	 !log clearing ats-be cache on cp6016 - T309651
[15:30:55] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) (owner: 10Cwhite)
[15:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:56] <stashbot>	 T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651
[15:31:21] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[15:31:51] <wikibugs>	 (03Merged) 10jenkins-bot: ServiceImageRecommendationProvider: Add extra logging when no JSON response received [extensions/GrowthExperiments] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819075 (https://phabricator.wikimedia.org/T313973) (owner: 10Urbanecm)
[15:32:09] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2024.codfw.wmnet
[15:32:30] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2024.codfw.wmnet with reason: PDU maintenance
[15:32:44] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2024.codfw.wmnet with reason: PDU maintenance
[15:32:45] <icinga-wm>	 PROBLEM - Host mc2023.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:33:03] <icinga-wm>	 PROBLEM - Host wcqs2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:33:05] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] geodns: Map out African countries by DC latency [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) (owner: 10BCornwall)
[15:33:39] <wikibugs>	 (03PS4) 10BCornwall: geodns: Map out African countries by DC latency [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472)
[15:34:11] <icinga-wm>	 PROBLEM - Host aqs2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:34:12] <icinga-wm>	 PROBLEM - Host aqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:34:12] <icinga-wm>	 PROBLEM - Host aqs2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:34:13] <icinga-wm>	 PROBLEM - Host aqs2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:34:52] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2009.codfw.wmnet
[15:34:53] <icinga-wm>	 PROBLEM - Host db2161.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:34:53] <icinga-wm>	 PROBLEM - Host db2162.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:35:07] <icinga-wm>	 PROBLEM - Host db2096.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:35:11] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on maps2009.codfw.wmnet with reason: PDU maintenance
[15:35:24] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on maps2009.codfw.wmnet with reason: PDU maintenance
[15:35:33] <icinga-wm>	 PROBLEM - Host rdb2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:36:07] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] "Looks good, can you point to the puppet config you're using to config those scap hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) (owner: 10Cwhite)
[15:36:17] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:36:22] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/GrowthExperiments/includes/NewcomerTasks/AddImage/ServiceImageRecommendationProvider.php: 4438957e78e0012aff646e52dc16a4fb796cfd6b: ServiceImageRecommendationProvider: Add extra logging when no JSON response received (T313973) (duration: 03m 04s)
[15:36:25] <stashbot>	 T313973: Exception: Invalid JSON response for page: Espejo - https://phabricator.wikimedia.org/T313973
[15:36:32] * urbanecm is done
[15:36:53] <elukey>	 !log powercycle kafka-logging2003 - not responsive to serial console
[15:36:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:37:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:37:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:38:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:38:15] <icinga-wm>	 PROBLEM - Host mw2331.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:38:41] <icinga-wm>	 PROBLEM - Host restbase2024.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:38:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:38:45] <icinga-wm>	 PROBLEM - Host kubernetes2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:38:45] <icinga-wm>	 PROBLEM - Host kubernetes2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:38:50] <vgutierrez>	 !log clearing ats-be cache on cp6008 - T309651
[15:38:51] <icinga-wm>	 PROBLEM - Host kubernetes2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:38:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:53] <stashbot>	 T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651
[15:38:55] <icinga-wm>	 PROBLEM - Host maps2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:39:08] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] scap: add option to selectivlely disable bootstrapping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) (owner: 10Cwhite)
[15:39:13] <icinga-wm>	 PROBLEM - Host ml-serve2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:39:51] <icinga-wm>	 PROBLEM - Host mw2326.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:39:51] <icinga-wm>	 PROBLEM - Host mw2327.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:39:52] <icinga-wm>	 PROBLEM - Host mw2328.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:39:52] <icinga-wm>	 PROBLEM - Host mw2332.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:39:52] <icinga-wm>	 PROBLEM - Host mw2325.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:39:52] <icinga-wm>	 PROBLEM - Host mw2329.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:39:52] <icinga-wm>	 PROBLEM - Host mw2330.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:39:53] <icinga-wm>	 PROBLEM - Host mw2333.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:39:53] <icinga-wm>	 PROBLEM - Host mw2334.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:40:20] <wikibugs>	 (03PS1) 10Jbond: Add helpful note about confd [dns] - 10https://gerrit.wikimedia.org/r/820155 (https://phabricator.wikimedia.org/T314489)
[15:40:33] <icinga-wm>	 PROBLEM - Host db2124.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:40:41] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001
[15:40:47] <wikibugs>	 (03CR) 10Jbond: Add helpful note about confd (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820145 (https://phabricator.wikimedia.org/T314489) (owner: 10CDanis)
[15:40:53] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:41:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Add helpful note about confd [dns] - 10https://gerrit.wikimedia.org/r/820155 (https://phabricator.wikimedia.org/T314489) (owner: 10Jbond)
[15:41:03] <icinga-wm>	 PROBLEM - Host db2134.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:27] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "thanks!" [dns] - 10https://gerrit.wikimedia.org/r/820155 (https://phabricator.wikimedia.org/T314489) (owner: 10Jbond)
[15:41:33] <icinga-wm>	 PROBLEM - Host db2098.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:39] <icinga-wm>	 PROBLEM - Host db2111.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:39] <icinga-wm>	 PROBLEM - Host db2110.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:45] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2002
[15:41:47] <icinga-wm>	 PROBLEM - Host dbproxy2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:42:47] <icinga-wm>	 PROBLEM - Juniper alarms on asw-b-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[15:42:57] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[15:44:05] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:44:59] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:45:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T312972)', diff saved to https://phabricator.wikimedia.org/P32242 and previous config saved to /var/cache/conftool/dbconfig/20220803-154515-marostegui.json
[15:45:21] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[15:45:43] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:45:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:45:47] <icinga-wm>	 PROBLEM - tileratorui on maps2006 is CRITICAL: connect to address 10.192.16.31 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[15:45:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10Slst2020) 05In progress→03Resolved
[15:46:49] <icinga-wm>	 PROBLEM - tilerator on maps2006 is CRITICAL: connect to address 10.192.16.31 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[15:48:51] <icinga-wm>	 PROBLEM - Disk space on ms-be2067 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdz1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops
[15:51:23] <wikibugs>	 (03PS5) 10BCornwall: geodns: Map out African countries by DC latency [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472)
[15:51:29] <icinga-wm>	 RECOVERY - tilerator on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[15:52:09] <icinga-wm>	 RECOVERY - Juniper alarms on asw-b-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[15:52:45] <jelto>	 !log pooling mw2259-2270 again
[15:52:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:27] <icinga-wm>	 RECOVERY - Host db2124.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.94 ms
[15:53:53] <icinga-wm>	 RECOVERY - Host db2134.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.32 ms
[15:54:21] <icinga-wm>	 RECOVERY - Host db2098.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms
[15:54:29] <icinga-wm>	 RECOVERY - Host db2111.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.87 ms
[15:54:29] <icinga-wm>	 RECOVERY - Host db2110.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.87 ms
[15:54:37] <icinga-wm>	 RECOVERY - Host dbproxy2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[15:55:39] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10hnowlan)
[15:57:43] <icinga-wm>	 RECOVERY - Host mw2333.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.82 ms
[15:58:09] <icinga-wm>	 RECOVERY - Host kubernetes2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 188.24 ms
[15:58:09] <icinga-wm>	 RECOVERY - Host kubernetes2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 187.74 ms
[15:58:16] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet with reason: PDU work
[15:58:17] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:21] <icinga-wm>	 RECOVERY - Host maps2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.56 ms
[15:58:31] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet with reason: PDU work
[15:58:37] <icinga-wm>	 RECOVERY - Host ml-serve2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.93 ms
[15:58:42] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=353f1e46-07cd-47d6-9a06-44c3a93b5b51) set by mvernon@cumin1001 for 1 day, 0:00:00 on 3 host(s) and t...
[15:59:09] <Emperor>	 !log shutdown ms-be20[33,47],thanos-be2002 prior to PDU work T310070
[15:59:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:12] <stashbot>	 T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070
[15:59:15] <icinga-wm>	 RECOVERY - Host mw2326.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.86 ms
[15:59:15] <icinga-wm>	 RECOVERY - Host mw2327.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms
[15:59:15] <icinga-wm>	 RECOVERY - Host mw2328.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.64 ms
[15:59:15] <icinga-wm>	 RECOVERY - Host mw2332.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms
[15:59:15] <icinga-wm>	 RECOVERY - Host mw2331.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.63 ms
[15:59:16] <icinga-wm>	 RECOVERY - Host mw2325.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms
[15:59:16] <icinga-wm>	 RECOVERY - Host mw2330.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms
[15:59:17] <icinga-wm>	 RECOVERY - Host mw2329.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[15:59:17] <icinga-wm>	 RECOVERY - Host mw2334.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms
[15:59:29] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:59:39] <icinga-wm>	 RECOVERY - tileratorui on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[16:00:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10Dzahn) @ERayfield Thank you for the update. Hope the get well soon.  I think the easiest way is we close this now and then later you can simply click reopen on this existing tick...
[16:00:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:00:51] <icinga-wm>	 RECOVERY - Host db2096.mgmt is UP: PING OK - Packet loss = 0%, RTA = 43.30 ms
[16:01:19] <icinga-wm>	 RECOVERY - Host rdb2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 43.95 ms
[16:01:41] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10Dzahn) 05In progress→03Declined Don't take the "declined" too literal. The expection is that you just change it to "open" again whenever you like.
[16:02:29] <icinga-wm>	 RECOVERY - Host wcqs2001 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms
[16:03:21] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:03:39] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[16:04:35] <icinga-wm>	 RECOVERY - Host restbase2024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.57 ms
[16:04:45] <icinga-wm>	 RECOVERY - Host kubernetes2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[16:04:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10ERayfield) Ok, thanks - that sounds good to me  Ellen
[16:05:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:05:11] <icinga-wm>	 RECOVERY - Host mc2023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.45 ms
[16:05:25] <icinga-wm>	 RECOVERY - Host wcqs2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.00 ms
[16:05:37] <icinga-wm>	 RECOVERY - Host aqs2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.86 ms
[16:06:27] <icinga-wm>	 RECOVERY - Host aqs2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[16:06:27] <icinga-wm>	 RECOVERY - Host aqs2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms
[16:06:28] <icinga-wm>	 RECOVERY - Host aqs2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.34 ms
[16:07:05] <icinga-wm>	 RECOVERY - Host db2161.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms
[16:07:07] <icinga-wm>	 RECOVERY - Host db2162.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms
[16:07:41] <icinga-wm>	 PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:03] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs[2005-2008].codfw.wmnet
[16:08:04] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs[2005-2008].codfw.wmnet
[16:08:08] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for 15 hosts
[16:08:13] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 15 hosts
[16:10:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:11:07] <icinga-wm>	 PROBLEM - Host ms-be2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:11:27] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.remove-downtime for 12 hosts
[16:11:31] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 12 hosts
[16:14:17] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mw2322 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:15:11] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[16:15:59] <icinga-wm>	 PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[16:16:25] <icinga-wm>	 PROBLEM - Host furud.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:17:11] <icinga-wm>	 PROBLEM - Host thanos-be2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:17:25] <icinga-wm>	 PROBLEM - Host ms-be2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:19:19] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:20:43] <icinga-wm>	 PROBLEM - Host elastic2080.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:20:55] <icinga-wm>	 PROBLEM - Host mc2046.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:21:17] <icinga-wm>	 PROBLEM - Host elastic2043.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:21:17] <icinga-wm>	 PROBLEM - Host elastic2044.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:21:21] <icinga-wm>	 PROBLEM - Host elastic2079.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:24:40] <wikibugs>	 (03PS1) 10Milimetric: role::common::aqs: update mw history [puppet] - 10https://gerrit.wikimedia.org/r/820160
[16:25:12] <wikibugs>	 (03CR) 10Milimetric: "https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Deploy_new_History_snapshot_for_Wikistats_Backend" [puppet] - 10https://gerrit.wikimedia.org/r/820160 (owner: 10Milimetric)
[16:25:46] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] role::common::aqs: update mw history [puppet] - 10https://gerrit.wikimedia.org/r/820160 (owner: 10Milimetric)
[16:26:55] <icinga-wm>	 PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100%
[16:27:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10RobH) a:05RobH→03Jclark-ctr
[16:27:41] <icinga-wm>	 RECOVERY - Host elastic2044.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 40.55 ms
[16:27:58] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes[2009-2010,2020].codfw.wmnet
[16:27:59] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2009-2010,2020].codfw.wmnet
[16:28:35] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[16:28:53] <icinga-wm>	 PROBLEM - IPMI Sensor Status on wcqs2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:30:05] <icinga-wm>	 RECOVERY - Host thanos-be2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms
[16:30:21] <icinga-wm>	 RECOVERY - Host ms-be2047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.58 ms
[16:30:33] <icinga-wm>	 RECOVERY - Host ms-be2033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.47 ms
[16:30:34] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for rdb2008.codfw.wmnet
[16:30:35] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for rdb2008.codfw.wmnet
[16:32:01] <jelto>	 !log power off mc2025-2026
[16:32:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:15] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7285 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[16:33:49] <icinga-wm>	 RECOVERY - Host mc2046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.66 ms
[16:34:11] <icinga-wm>	 RECOVERY - Host elastic2043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms
[16:34:13] <icinga-wm>	 RECOVERY - Host elastic2079.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms
[16:34:13] <icinga-wm>	 RECOVERY - Host elastic2080.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms
[16:34:33] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[16:34:39] <icinga-wm>	 RECOVERY - Host furud.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.05 ms
[16:34:39] <icinga-wm>	 PROBLEM - Host mc2025 is DOWN: PING CRITICAL - Packet loss = 100%
[16:34:39] <icinga-wm>	 PROBLEM - Host mc2026 is DOWN: PING CRITICAL - Packet loss = 100%
[16:35:14] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on mc[2025-2026].codfw.wmnet with reason: PDU swap
[16:35:17] <icinga-wm>	 RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms
[16:35:28] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mc[2025-2026].codfw.wmnet with reason: PDU swap
[16:36:53] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1021 mgmt flapping - https://phabricator.wikimedia.org/T314413 (10Dzahn) "flapping mgmt" in Icinga has been reported as succesfully fixed on other hosts through either "reset DRAC" and/or "upgrade DRAC / firmware".  also see: T304289 (maybe this sh...
[16:37:42] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on gitlab-runner2002.codfw.wmnet with reason: PDU swap
[16:37:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10Dzahn)
[16:37:51] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1021 mgmt flapping - https://phabricator.wikimedia.org/T314413 (10Dzahn)
[16:37:56] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on gitlab-runner2002.codfw.wmnet with reason: PDU swap
[16:38:34] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc2023.codfw.wmnet
[16:38:34] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2023.codfw.wmnet
[16:39:20] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for 10 hosts
[16:39:24] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 10 hosts
[16:40:12] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc2046.codfw.wmnet
[16:40:12] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2046.codfw.wmnet
[16:42:15] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1044 site=eqiad tunnel=mc2026_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[16:42:27] <icinga-wm>	 PROBLEM - Host db2164.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:45:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:46:01] <icinga-wm>	 PROBLEM - Host mc2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:01] <icinga-wm>	 PROBLEM - Host mc2025.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:09] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet
[16:46:10] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet
[16:46:57] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:47:18] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: in setup / flapping
[16:47:25] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: PDU work
[16:47:43] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: in setup / flapping
[16:47:44] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: PDU work
[16:47:49] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[16:47:50] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0444b6dc-d394-43ed-8847-01dae0f308ee) set by mvernon@cumin1001 for 1 day, 0:00:00 on 8 host(s) and their services with rea...
[16:48:09] <icinga-wm>	 PROBLEM - Host elastic2030.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:48:09] <icinga-wm>	 PROBLEM - Host gitlab-runner2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:48:11] <icinga-wm>	 PROBLEM - Host es2029.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:48:11] <icinga-wm>	 PROBLEM - Host es2030.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:48:20] <Emperor>	 !log shutdown  moss-fe2001.codfw.wmnet,ms-fe2011.codfw.wmnet,ms-be20[34,35,42,48,55,68].codfw.wmnet PDU work T310145
[16:48:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:23] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[16:49:03] <icinga-wm>	 PROBLEM - Host kubestage2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:50:51] <icinga-wm>	 PROBLEM - Host wdqs2007 is DOWN: PING CRITICAL - Packet loss = 100%
[16:51:15] <icinga-wm>	 PROBLEM - Host db2148.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:51:19] <icinga-wm>	 PROBLEM - Host db2163.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:51:27] <icinga-wm>	 PROBLEM - Host parse2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:51:29] <icinga-wm>	 PROBLEM - Host parse2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:51:31] <icinga-wm>	 PROBLEM - Host ganeti2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:51:31] <icinga-wm>	 PROBLEM - Host ganeti2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:52:01] <icinga-wm>	 PROBLEM - Host es2025.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:52:09] <icinga-wm>	 PROBLEM - Host ps1-b8-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[16:53:05] <icinga-wm>	 PROBLEM - Host elastic2029.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:54:01] <icinga-wm>	 PROBLEM - Host parse2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:54:45] <icinga-wm>	 PROBLEM - Juniper alarms on asw-b-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[16:54:51] <icinga-wm>	 PROBLEM - Host restbase2014.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:56:03] <icinga-wm>	 PROBLEM - Host wdqs2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:56:34] <wikibugs>	 (03PS1) 10AOkoth: gitlab: copy ssh host keys for failover [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713)
[16:58:46] <wikibugs>	 (03PS1) 10Jgreen: Remove frauth1001 from Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/820164 (https://phabricator.wikimedia.org/T299068)
[16:59:56] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1001/36611/" [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth)
[16:59:58] <wikibugs>	 (03PS1) 10Ladsgroup: auto_schema: Change replica_set to be all replicas in all dcs [software] - 10https://gerrit.wikimedia.org/r/820165 (https://phabricator.wikimedia.org/T314486)
[17:00:15] <icinga-wm>	 RECOVERY - IPMI Sensor Status on wcqs2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[17:00:29] <wikibugs>	 (03PS6) 10Ayounsi: sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380
[17:00:37] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[17:01:21] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7398 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[17:03:13] <wikibugs>	 (03PS1) 10Btullis: Revert the AQS mediawiki history change [puppet] - 10https://gerrit.wikimedia.org/r/820167
[17:04:49] <icinga-wm>	 PROBLEM - Host elastic2031 is DOWN: PING CRITICAL - Packet loss = 100%
[17:05:31] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert the AQS mediawiki history change [puppet] - 10https://gerrit.wikimedia.org/r/820167 (owner: 10Btullis)
[17:06:55] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=(kubernetes2020.codfw.wmnet|kubernetes2009.codfw.wmnet|kubernetes2010.codfw.wmnet)
[17:07:55] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti2019 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[17:08:07] <icinga-wm>	 PROBLEM - Host gitlab-runner2002 is DOWN: PING CRITICAL - Packet loss = 100%
[17:08:25] <ryankemper>	 !log T310145 `elastic2031` and `wcqs2002` powered off in preparation for C1 maintenance
[17:08:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:28] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[17:08:49] <icinga-wm>	 RECOVERY - Juniper alarms on asw-b-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[17:09:21] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10RKemper)
[17:10:15] <icinga-wm>	 PROBLEM - Host wcqs2002 is DOWN: PING CRITICAL - Packet loss = 100%
[17:10:55] <icinga-wm>	 RECOVERY - Host parse2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms
[17:10:55] <icinga-wm>	 RECOVERY - Host parse2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.20 ms
[17:10:57] <icinga-wm>	 RECOVERY - Host ganeti2019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.99 ms
[17:10:57] <icinga-wm>	 RECOVERY - Host ganeti2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.26 ms
[17:11:27] <icinga-wm>	 RECOVERY - Host es2025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.88 ms
[17:12:29] <icinga-wm>	 RECOVERY - Host elastic2029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms
[17:13:29] <icinga-wm>	 RECOVERY - Host parse2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms
[17:14:15] <icinga-wm>	 RECOVERY - Host restbase2014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[17:14:23] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for 6 hosts
[17:14:25] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 6 hosts
[17:14:33] <icinga-wm>	 RECOVERY - Host wdqs2007 is UP: PING OK - Packet loss = 0%, RTA = 31.71 ms
[17:14:47] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[17:15:31] <icinga-wm>	 RECOVERY - Host wdqs2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms
[17:17:13] <icinga-wm>	 RECOVERY - Host db2148.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[17:17:19] <icinga-wm>	 RECOVERY - Host db2163.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms
[17:18:31] <icinga-wm>	 RECOVERY - Host mc2026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms
[17:18:31] <icinga-wm>	 RECOVERY - Host mc2025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms
[17:18:59] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti2020 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[17:19:29] <icinga-wm>	 RECOVERY - Host mc2025 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms
[17:20:35] <icinga-wm>	 RECOVERY - Host gitlab-runner2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.96 ms
[17:20:35] <icinga-wm>	 RECOVERY - Host elastic2030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms
[17:20:41] <icinga-wm>	 RECOVERY - Host es2029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.62 ms
[17:20:41] <icinga-wm>	 RECOVERY - Host es2030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 50.99 ms
[17:21:31] <icinga-wm>	 RECOVERY - Host db2164.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.55 ms
[17:21:33] <icinga-wm>	 RECOVERY - Host kubestage2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.58 ms
[17:23:35] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase20[12]4.codfw.wmnet
[17:23:37] <icinga-wm>	 RECOVERY - Host gitlab-runner2002 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms
[17:24:01] <icinga-wm>	 RECOVERY - Host mc2026 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms
[17:24:20] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10hnowlan)
[17:24:29] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[17:25:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:26:17] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[17:28:05] <icinga-wm>	 PROBLEM - Host mc2027 is DOWN: PING CRITICAL - Packet loss = 100%
[17:30:01] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[17:30:17] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:32:57] <jinxer-wm>	 (Traffic bill over quota) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[17:33:11] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[17:33:33] <icinga-wm>	 PROBLEM - HP RAID on ms-be2035 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[17:33:36] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be2035 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T314509 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gat
[17:33:40] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10ops-monitoring-bot)
[17:36:39] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7074 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[17:37:39] <icinga-wm>	 PROBLEM - Host es2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:37:45] <icinga-wm>	 PROBLEM - Host kubernetes2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:37:58] <jinxer-wm>	 (Traffic bill over quota) firing: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[17:38:12] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse[2008-2010].codfw.wmnet
[17:38:13] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2008-2010].codfw.wmnet
[17:39:15] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti2019 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[17:39:19] <icinga-wm>	 PROBLEM - Host es2032.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:39:29] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[17:39:37] <icinga-wm>	 PROBLEM - Host db2138.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:39:39] <icinga-wm>	 PROBLEM - Host kubernetes2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:40:41] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7104 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[17:41:35] <icinga-wm>	 PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:42:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:42:49] <icinga-wm>	 PROBLEM - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:43:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook)
[17:43:23] <icinga-wm>	 PROBLEM - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:43:31] <icinga-wm>	 PROBLEM - Host mc2037.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:45:21] <icinga-wm>	 PROBLEM - Host cumin2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:45:37] <icinga-wm>	 PROBLEM - Host elastic2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:45:54] <wikibugs>	 (03PS1) 10Dduvall: phabricator: Provide script for ensuring correct config file ownership [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950)
[17:46:21] <icinga-wm>	 PROBLEM - Host ores2005 is DOWN: PING CRITICAL - Packet loss = 100%
[17:46:33] <icinga-wm>	 PROBLEM - Host cloudcontrol2003-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:46:41] <icinga-wm>	 PROBLEM - Host wcqs2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:47:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:47:49] <icinga-wm>	 PROBLEM - Host ganeti2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:47:49] <icinga-wm>	 PROBLEM - Host ganeti2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:47:55] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ml-serve2006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[17:48:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: Provide script for ensuring correct config file ownership [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[17:48:33] <wikibugs>	 (03PS3) 10Ebernhardson: Change CirrusSearchElasticaWrite partitioning key [deployment-charts] - 10https://gerrit.wikimedia.org/r/819752 (https://phabricator.wikimedia.org/T314426)
[17:48:55] <icinga-wm>	 PROBLEM - Host db2149.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:50:17] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti2020 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[17:50:32] <logmsgbot>	 !log rzl@cumin1001 conftool action : set/pooled=yes; selector: name=kubestage2002.codfw.wmnet
[17:50:33] <icinga-wm>	 PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[17:51:15] <icinga-wm>	 PROBLEM - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:51:43] <icinga-wm>	 PROBLEM - Host pc2013.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:51:51] <icinga-wm>	 RECOVERY - k8s requests count to the API on ml-serve-ctrl2001 is OK: (C)100 ge (W)50 ge 35.77 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[17:52:17] <icinga-wm>	 PROBLEM - Host restbase2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:52:21] <icinga-wm>	 PROBLEM - Host restbase2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:52:31] <icinga-wm>	 PROBLEM - Host scs-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[17:52:58] <jinxer-wm>	 (Traffic bill over quota) firing: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[17:53:56] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Best to move this forward prior to the weekly index rollover." [puppet] - 10https://gerrit.wikimedia.org/r/818516 (owner: 10Cwhite)
[17:54:29] <wikibugs>	 (03PS1) 10Ebernhardson: Add explicit partitioning key to ElasticaWrite [extensions/CirrusSearch] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/820190 (https://phabricator.wikimedia.org/T314426)
[17:55:05] <wikibugs>	 (03PS1) 10Ebernhardson: Add explicit partitioning key to ElasticaWrite [extensions/CirrusSearch] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820191 (https://phabricator.wikimedia.org/T314426)
[17:55:31] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2055.codfw.wmnet
[17:55:31] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2055.codfw.wmnet
[17:55:39] <ottomata>	 !log increasing partitions from 5 to 6 for *.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite topics in Kafka main-eqiad and main-codfw - T314426
[17:55:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:47] <stashbot>	 T314426: Job queue for writes to cloudelastic falling behind - https://phabricator.wikimedia.org/T314426
[17:56:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for elastic2043.codfw.wmnet
[17:56:56] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic2043.codfw.wmnet
[17:57:05] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for elastic2044.codfw.wmnet
[17:57:06] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic2044.codfw.wmnet
[17:57:29] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:57:32] <wikibugs>	 (03PS2) 10Dduvall: phabricator: Provide script for ensuring correct config file ownership [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950)
[17:57:35] <icinga-wm>	 RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[17:57:38] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc[2025-2026].codfw.wmnet
[17:57:38] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2025-2026].codfw.wmnet
[17:57:58] <jinxer-wm>	 (Traffic bill over quota) resolved: Alert for device cr3-knams.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[17:58:02] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubestage2002.codfw.wmnet
[17:58:03] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubestage2002.codfw.wmnet
[17:58:43] <icinga-wm>	 RECOVERY - Host restbase2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.72 ms
[17:58:45] <icinga-wm>	 RECOVERY - Host es2032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms
[17:58:45] <icinga-wm>	 RECOVERY - Host es2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.13 ms
[17:59:03] <icinga-wm>	 RECOVERY - Host db2138.mgmt is UP: PING OK - Packet loss = 0%, RTA = 47.58 ms
[17:59:05] <icinga-wm>	 RECOVERY - Host kubernetes2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.48 ms
[17:59:05] <icinga-wm>	 RECOVERY - Host kubernetes2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 72.39 ms
[17:59:51] <icinga-wm>	 RECOVERY - Host ganeti2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms
[17:59:51] <icinga-wm>	 RECOVERY - Host ganeti2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.21 ms
[18:00:05] <jouncebot>	 dancy and brennen: That opportune time is upon us again. Time for a Train log triage with CPT deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T1800).
[18:00:05] <jouncebot>	 dancy and brennen: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T1800).
[18:00:10] <brennen>	 o/
[18:00:25] <dancy>	 o/
[18:00:44] <dancy>	 Preparing to press the button
[18:00:49] <icinga-wm>	 RECOVERY - Host db2149.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.84 ms
[18:00:49] <icinga-wm>	 RECOVERY - Host wcqs2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.56 ms
[18:01:05] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Remove frauth1001 from Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/820164 (https://phabricator.wikimedia.org/T299068) (owner: 10Jgreen)
[18:01:33] <icinga-wm>	 RECOVERY - Host mc2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms
[18:01:39] <icinga-wm>	 RECOVERY - Host mc2037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.94 ms
[18:02:16] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1021 mgmt flapping - https://phabricator.wikimedia.org/T314413 (10wiki_willy) a:03Cmjohnson
[18:02:33] <icinga-wm>	 PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 108.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[18:03:13] <wikibugs>	 (03CR) 10Dduvall: "Note I added a new script instead of modifying phab_deploy_finalize so that this action could occur independent of the latter and directly" [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[18:03:19] <icinga-wm>	 RECOVERY - Host pc2013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms
[18:03:29] <icinga-wm>	 RECOVERY - Host cumin2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.12 ms
[18:03:45] <icinga-wm>	 RECOVERY - Host elastic2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms
[18:04:05] <icinga-wm>	 RECOVERY - Host scs-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.52 ms
[18:04:31] <icinga-wm>	 RECOVERY - Host elastic2031 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms
[18:04:59] <icinga-wm>	 PROBLEM - Check systemd state on elastic2031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:05:09] <icinga-wm>	 RECOVERY - Host cloudcontrol2003-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.35 ms
[18:05:11] <icinga-wm>	 RECOVERY - Host wcqs2002 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms
[18:05:19] <icinga-wm>	 RECOVERY - Host mc2027 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms
[18:06:17] <icinga-wm>	 RECOVERY - Host db2112.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.59 ms
[18:06:29] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti2010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:07:05] <rzl>	 brennen, dancy: note that power maintenance in codfw continues, you should be fine to deploy but don't be surprised when icinga is noisy, and you have to squint a little closer to watch for alerts you do care about
[18:07:15] <dancy>	 thx!
[18:07:25] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820176 (https://phabricator.wikimedia.org/T308076)
[18:07:26] <brennen>	 thanks rzl 
[18:07:27] <icinga-wm>	 RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:07:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820176 (https://phabricator.wikimedia.org/T308076) (owner: 10TrainBranchBot)
[18:07:29] <dancy>	 The button has been pressed
[18:07:43] <rzl>	 (in particular there are no app servers left today so scapping should be unaffected)
[18:07:57] <icinga-wm>	 RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.73 ms
[18:08:34] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820176 (https://phabricator.wikimedia.org/T308076) (owner: 10TrainBranchBot)
[18:09:21] <icinga-wm>	 RECOVERY - Host restbase2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms
[18:09:31] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mc2027 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:10:41] <icinga-wm>	 RECOVERY - Host ores2005 is UP: PING OK - Packet loss = 0%, RTA = 33.45 ms
[18:10:49] <icinga-wm>	 PROBLEM - ores on ores2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[18:11:27] <icinga-wm>	 PROBLEM - ores_workers_running on ores2005 is CRITICAL: PROCS CRITICAL: 2 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[18:12:05] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7128 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[18:12:11] <icinga-wm>	 RECOVERY - ores on ores2005 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[18:12:37] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.23  refs T308076
[18:12:37] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[18:12:41] <stashbot>	 T308076: 1.39.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T308076
[18:12:57] <icinga-wm>	 RECOVERY - ores_workers_running on ores2005 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[18:14:07] <icinga-wm>	 RECOVERY - Host ores2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.90 ms
[18:14:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[18:15:25] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:15:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:15:47] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[18:16:15] <logmsgbot>	 !log dancy@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.23  refs T308076 (duration: 03m 37s)
[18:16:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:16:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:18:35] <icinga-wm>	 RECOVERY - Cassandra instance data free space on restbase1017 is OK: DISK OK - free space: /srv/cassandra/instance-data 11005 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[18:21:37] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:21:39] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[18:22:23] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:23:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:24:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (2) kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[18:25:03] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7054 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[18:26:01] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:26:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:28:03] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:28:40] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission frauth1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T314517 (10Jgreen)
[18:31:46] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mc2027 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:33:12] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc[2027,2037].codfw.wmnet
[18:33:13] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2027,2037].codfw.wmnet
[18:35:34] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid
[18:36:44] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti2010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:37:16] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:38:08] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[18:38:26] <wikibugs>	 (03PS2) 10Jcrespo: Revert "Setup temporary arrangement for codfw snapshots durin PDU maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/819794
[18:42:51] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/819794 (owner: 10Jcrespo)
[18:44:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2159 db2143', diff saved to https://phabricator.wikimedia.org/P32243 and previous config saved to /var/cache/conftool/dbconfig/20220803-184432-marostegui.json
[18:45:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[18:45:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[18:45:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[18:45:37] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[18:45:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[18:45:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[18:46:00] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:46:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T312972)', diff saved to https://phabricator.wikimedia.org/P32244 and previous config saved to /var/cache/conftool/dbconfig/20220803-184603-marostegui.json
[18:46:08] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[18:48:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312972)', diff saved to https://phabricator.wikimedia.org/P32245 and previous config saved to /var/cache/conftool/dbconfig/20220803-184816-marostegui.json
[18:51:28] <jynus>	 Ci is taking 20 minutes to run
[18:51:58] <jynus>	 normally it takes 2-3
[18:52:11] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "Setup temporary arrangement for codfw snapshots durin PDU maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/819794 (owner: 10Jcrespo)
[18:55:08] <wikibugs>	 (03PS1) 10Ladsgroup: auto_schema: Add tests for replicas to pick [software] - 10https://gerrit.wikimedia.org/r/820181 (https://phabricator.wikimedia.org/T314486)
[18:55:14] <wikibugs>	 (03PS1) 10Cwhite: logstash: route k8s messages to k8s partition [puppet] - 10https://gerrit.wikimedia.org/r/820182 (https://phabricator.wikimedia.org/T314381)
[18:56:13] <logmsgbot>	 !log rzl@deploy1002 conftool action : set/pooled=yes; selector: name=kubernetes2011.codfw.wmnet
[18:56:29] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes2011.codfw.wmnet
[18:56:29] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2011.codfw.wmnet
[18:56:39] <wikibugs>	 (03PS2) 10Ladsgroup: auto_schema: Add tests for replicas to pick [software] - 10https://gerrit.wikimedia.org/r/820181 (https://phabricator.wikimedia.org/T299445)
[18:58:10] <wikibugs>	 (03CR) 10Marostegui: "I am thinking....how should we start treating the codfw master? We'll no longer be able to run schema changes directly on it, but how's au" [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup)
[19:01:38] <wikibugs>	 (03PS17) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[19:01:40] <wikibugs>	 (03PS2) 10Jbond: peeringdb: refactor suggestions [software/spicerack] - 10https://gerrit.wikimedia.org/r/820103
[19:02:27] <wikibugs>	 (03PS3) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/820103
[19:03:15] <wikibugs>	 (03PS1) 10Dduvall: Ensure correct group ownership following config_deploy [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/820183 (https://phabricator.wikimedia.org/T313950)
[19:03:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P32246 and previous config saved to /var/cache/conftool/dbconfig/20220803-190321-marostegui.json
[19:04:25] <wikibugs>	 (03CR) 10Dzahn: "I know I recommended rsync::quickdatacopy myself but this was assuming it was a new thing. Since there is already the existing code here t" [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth)
[19:04:56] <wikibugs>	 (03PS2) 10Dduvall: Ensure correct group ownership following config_deploy [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/820183 (https://phabricator.wikimedia.org/T313950)
[19:05:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew)
[19:08:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "thanks for this!" [alerts] - 10https://gerrit.wikimedia.org/r/820072 (https://phabricator.wikimedia.org/T314353) (owner: 10Filippo Giunchedi)
[19:10:12] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Fix ReplyLinksController#teardown [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820192
[19:11:37] <wikibugs>	 (03PS2) 10Brennen Bearnes: scap: stub out a checks.yaml [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/818231 (https://phabricator.wikimedia.org/T313953)
[19:12:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] auto_schema: Add tests for replicas to pick [software] - 10https://gerrit.wikimedia.org/r/820181 (https://phabricator.wikimedia.org/T299445) (owner: 10Ladsgroup)
[19:13:10] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7126 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[19:14:22] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7235 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[19:16:07] <wikibugs>	 (03PS4) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/820103
[19:17:42] <icinga-wm>	 PROBLEM - Memcached on mc2038 is CRITICAL: connect to address 10.192.0.191 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[19:17:50] <wikibugs>	 (03PS18) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[19:18:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P32247 and previous config saved to /var/cache/conftool/dbconfig/20220803-191828-marostegui.json
[19:18:58] <wikibugs>	 (03CR) 10Jbond: PeeringDB API: initial commit (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[19:19:39] <wikibugs>	 (03Abandoned) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/820103 (owner: 10Jbond)
[19:21:06] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "Seems right; https://gerrit.wikimedia.org/r/c/phabricator/deployment/+/818231 could probably be combined with this." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/820183 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[19:23:25] <wikibugs>	 (03PS2) 10Xcollazo: airflow - Configure new platform_eng instance and rename old one as legacy. [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858)
[19:24:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] airflow - Configure new platform_eng instance and rename old one as legacy. [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo)
[19:25:09] <mutante>	 !log gerrit1001 - rsyncing /var/lib/gerrit/review_site/ over to gerrit2002 815401
[19:25:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) 05Open→03Resolved Disk replaced
[19:26:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[19:27:15] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Papaul)
[19:33:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312972)', diff saved to https://phabricator.wikimedia.org/P32248 and previous config saved to /var/cache/conftool/dbconfig/20220803-193334-marostegui.json
[19:33:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[19:33:38] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[19:33:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[19:33:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T312972)', diff saved to https://phabricator.wikimedia.org/P32249 and previous config saved to /var/cache/conftool/dbconfig/20220803-193354-marostegui.json
[19:35:31] <wikibugs>	 (03PS3) 10Xcollazo: airflow - Configure new platform_eng instance and rename old one as legacy. [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858)
[19:36:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312972)', diff saved to https://phabricator.wikimedia.org/P32250 and previous config saved to /var/cache/conftool/dbconfig/20220803-193607-marostegui.json
[19:38:50] <wikibugs>	 (03CR) 10Brennen Bearnes: "Sidebar to this patch, I just noticed:" [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[19:38:52] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - T314078
[19:38:55] <stashbot>	 T314078: Fix slow super_detect_noop code and monitor for future Elastic hangs - https://phabricator.wikimedia.org/T314078
[19:39:41] <ryankemper>	 !log T314078 Rolling upgrade of codfw hosts; after this all of eqiad/codfw will have the new plugin version and we can resume the `search-loader` instances: `sudo -E cookbook sre.elasticsearch.rolling-operation search_codfw "codfw cluster plugin upgrade" --upgrade --nodes-per-run 3 --start-datetime 2022-08-03T19:38:10 --task-id T314078`
[19:39:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:04] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7058 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[19:40:34] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:40:41] <ryankemper>	 !log T314078 Forgot to mention, restart is at `ryankemper@cumin1001` tmux session `codfw_restarts`
[19:40:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:34] <wikibugs>	 (03CR) 10Xcollazo: airflow - Configure new platform_eng instance and rename old one as legacy. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo)
[19:44:24] <wikibugs>	 (03PS1) 10Dzahn: acme_chief: add gerrit-replica-new to SNI list [puppet] - 10https://gerrit.wikimedia.org/r/820185 (https://phabricator.wikimedia.org/T313250)
[19:45:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] acme_chief: add gerrit-replica-new to SNI list [puppet] - 10https://gerrit.wikimedia.org/r/820185 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[19:45:11] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] acme_chief: add gerrit-replica-new to SNI list [puppet] - 10https://gerrit.wikimedia.org/r/820185 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[19:45:27] <wikibugs>	 (03PS1) 10BryanDavis: striker: Remove uwsgi deployment logging config [puppet] - 10https://gerrit.wikimedia.org/r/820206 (https://phabricator.wikimedia.org/T306469)
[19:45:31] <wikibugs>	 (03PS1) 10BryanDavis: striker: remove from scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/820207 (https://phabricator.wikimedia.org/T306469)
[19:46:46] <icinga-wm>	 RECOVERY - Host ps1-b3-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.78 ms
[19:48:27] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/820182 (https://phabricator.wikimedia.org/T314381) (owner: 10Cwhite)
[19:49:20] <icinga-wm>	 RECOVERY - Cassandra instance data free space on restbase1017 is OK: DISK OK - free space: /srv/cassandra/instance-data 11633 MB (32% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[19:51:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P32251 and previous config saved to /var/cache/conftool/dbconfig/20220803-195113-marostegui.json
[19:52:18] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:52:52] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7363 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[19:55:00] <wikibugs>	 (03CR) 10Dduvall: scap: stub out a checks.yaml (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/818231 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes)
[19:55:42] <wikibugs>	 (03PS5) 10Dzahn: gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250)
[19:55:44] <wikibugs>	 (03CR) 10AOkoth: gitlab: copy ssh host keys for failover (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth)
[19:58:59] <wikibugs>	 (03PS3) 10Dduvall: scap: stub out a checks.yaml [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/818231 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes)
[19:59:01] <wikibugs>	 (03PS6) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950)
[19:59:03] <wikibugs>	 (03PS3) 10Dduvall: Ensure correct group ownership following config_deploy [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/820183 (https://phabricator.wikimedia.org/T313950)
[19:59:22] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T2000).
[20:00:04] <jouncebot>	 zabe, danisztls, ebernhardson, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:06] <logmsgbot>	 !log rzl@deploy1002 conftool action : set/pooled=yes; selector: name=kubernetes2012.codfw.wmnet
[20:00:12] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes2012.codfw.wmnet
[20:00:13] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2012.codfw.wmnet
[20:00:17] <danisztls>	 here ./
[20:00:20] <ebernhardson>	 \o
[20:00:21] <MatmaRex>	 hi
[20:00:26] <urbanecm>	 I can deploy today
[20:00:26] <danisztls>	 sry for not attending yesterday :)
[20:00:34] <urbanecm>	 no problem, it happens sometimes :)
[20:00:38] <zabe>	 hey :)
[20:00:55] <wikibugs>	 (03PS2) 10Urbanecm: Start writing to cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819765 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:01:07] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Start writing to cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819765 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:01:14] <urbanecm>	 hi zabe! let's start with you :)
[20:01:22] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add explicit partitioning key to ElasticaWrite [extensions/CirrusSearch] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/820190 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson)
[20:01:24] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add explicit partitioning key to ElasticaWrite [extensions/CirrusSearch] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820191 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson)
[20:01:33] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Fix ReplyLinksController#teardown [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820192 (owner: 10Bartosz Dziewoński)
[20:01:46] <wikibugs>	 (03PS1) 10RLazarus: Revert "Reducing codfw mobileapps during maintenance" [deployment-charts] - 10https://gerrit.wikimedia.org/r/820193
[20:01:48] <wikibugs>	 (03PS3) 10Urbanecm: QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza)
[20:02:48] <wikibugs>	 (03Merged) 10jenkins-bot: Start writing to cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819765 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:03:30] <urbanecm>	 zabe: mwdebug1001 has your patch! can you check it please?
[20:03:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza)
[20:03:52] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.ganeti.makevm for new host logstash2032.codfw.wmnet
[20:03:54] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.dns.netbox
[20:03:59] <urbanecm>	 (I recall last time it had some issues, but I don't recall which ones. it'd be great to test it doesn't happen now)
[20:04:08] <urbanecm>	 danisztls: your patch seems to fail CI. Can you please fix it?
[20:04:19] <danisztls>	 urbanecm: yes
[20:04:23] <urbanecm>	 also, let me add you to the CI allowlist, so it runs automatically for you
[20:04:34] <icinga-wm>	 RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11543 MB (32% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[20:04:59] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Revert "Reducing codfw mobileapps during maintenance" [deployment-charts] - 10https://gerrit.wikimedia.org/r/820193 (owner: 10RLazarus)
[20:05:15] <zabe>	 urbanecm, did https://test.wikipedia.org/w/index.php?title=Test&type=revision&diff=519601&oldid=519540 Could you check whether cuc_actor got the correct value?
[20:05:51] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/36612/" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[20:05:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[20:06:01] <urbanecm>	 zabe: on it
[20:06:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P32252 and previous config saved to /var/cache/conftool/dbconfig/20220803-200619-marostegui.json
[20:06:38] <danisztls>	 urbanecm: thanks that will be super useful
[20:07:17] <mutante>	 !log gerrit - adding second replica T313250
[20:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:21] <stashbot>	 T313250: Bring up gerrit2002 - https://phabricator.wikimedia.org/T313250
[20:07:52] <urbanecm>	 zabe: looks like it works. can you verify? https://www.irccloud.com/pastebin/EzG2l1Fg/
[20:08:07] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:08:07] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash2032.codfw.wmnet on all recursors
[20:08:10] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash2032.codfw.wmnet on all recursors
[20:08:22] <wikibugs>	 (03PS1) 10Dduvall: phabricator: Stop managing /srv/phab/repos [puppet] - 10https://gerrit.wikimedia.org/r/820213 (https://phabricator.wikimedia.org/T313953)
[20:08:32] <zabe>	 urbanecm, yes looks good
[20:08:58] <urbanecm>	 zabe: excellent, syncing
[20:08:58] <urbanecm>	 danisztls: no problem. uploaded https://gerrit.wikimedia.org/r/820212 – once someone hits +2 on that, you'll be allowlisted. should be fairly quick :).
[20:09:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:09:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:09:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:10:10] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Reducing codfw mobileapps during maintenance" [deployment-charts] - 10https://gerrit.wikimedia.org/r/820193 (owner: 10RLazarus)
[20:10:30] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Update call to PageConfigFactory::create to use new signature [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820194 (https://phabricator.wikimedia.org/T314523)
[20:10:40] <wikibugs>	 (03PS3) 10Ladsgroup: auto_schema: Add tests for replicas to pick [software] - 10https://gerrit.wikimedia.org/r/820181 (https://phabricator.wikimedia.org/T299445)
[20:10:44] <wikibugs>	 (03PS4) 10DDesouza: QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333)
[20:10:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] striker: Remove uwsgi deployment logging config [puppet] - 10https://gerrit.wikimedia.org/r/820206 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis)
[20:11:16] <urbanecm>	 MatmaRex: i see you're uploading some other backports, do you want me to ship them as well?
[20:11:20] <MatmaRex>	 i added one more patch to the window, i hope that's okay
[20:11:23] <MatmaRex>	 yes
[20:11:27] <urbanecm>	 okay, let me +2 it
[20:11:31] <mutante>	 jouncebot: now
[20:11:31] <jouncebot>	 For the next 0 hour(s) and 48 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T2000)
[20:11:33] <MatmaRex>	 funnily enough, the bug affects the Deployments page
[20:11:38] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update call to PageConfigFactory::create to use new signature [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820194 (https://phabricator.wikimedia.org/T314523) (owner: 10Bartosz Dziewoński)
[20:11:40] <MatmaRex>	 and i found it when adding my previous backports there
[20:11:50] <urbanecm>	 heh
[20:12:18] <danisztls>	 urbanecm: should be good now
[20:12:20] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:12:30] <urbanecm>	 danisztls: thanks, checking
[20:12:37] <wikibugs>	 (03CR) 10Urbanecm: "rechecking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza)
[20:12:40] <urbanecm>	 well, letting CI to
[20:12:45] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza)
[20:12:45] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 195f8090b9694be65c937cea108ff4f6400972ec: Start writing to cuc_actor on test wikis (T233004) (duration: 03m 27s)
[20:12:48] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[20:12:59] <urbanecm>	 zabe: and your patch is live. please monitor the logs for a while :)
[20:13:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "moderately confused that this file is called 'kubernetes.yaml' but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/820207 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis)
[20:13:04] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[20:13:14] <zabe>	 sure, thanks :)
[20:13:21] <urbanecm>	 no problem
[20:13:33] <urbanecm>	 thanks for working on the migration!
[20:13:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:13:47] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[20:14:08] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[20:14:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] wmcs: some yaml autoformatting [puppet] - 10https://gerrit.wikimedia.org/r/816005 (owner: 10David Caro)
[20:15:22] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[20:15:53] <wikibugs>	 (03PS5) 10Urbanecm: QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza)
[20:15:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza)
[20:16:15] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Open firewall port to connecto to the LibreNMS database. [puppet] - 10https://gerrit.wikimedia.org/r/820215 (https://phabricator.wikimedia.org/T309074)
[20:16:18] <urbanecm>	 danisztls: CI approved it (and it will run automatically on your next commits), so let's deploy and see :)
[20:16:27] <urbanecm>	 (will be available at beta within half an hour)
[20:16:35] <danisztls>	 urbanecm: thanks!
[20:17:19] <urbanecm>	 np
[20:17:23] * urbanecm waits on CI now for the backports
[20:17:33] <wikibugs>	 (03Merged) 10jenkins-bot: QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza)
[20:17:55] <wikibugs>	 (03PS1) 10Ladsgroup: auto_schema: Treat master of all dcs like a master of active dc [software] - 10https://gerrit.wikimedia.org/r/820216 (https://phabricator.wikimedia.org/T314486)
[20:18:07] <wikibugs>	 (03PS3) 10Dduvall: phabricator: Provide script for ensuring correct config file ownership [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950)
[20:18:09] <wikibugs>	 (03PS4) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950)
[20:18:11] <wikibugs>	 (03PS4) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche)
[20:19:59] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/820072 (https://phabricator.wikimedia.org/T314353) (owner: 10Filippo Giunchedi)
[20:20:45] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/820072 (https://phabricator.wikimedia.org/T314353) (owner: 10Filippo Giunchedi)
[20:20:53] <wikibugs>	 (03Merged) 10jenkins-bot: Add explicit partitioning key to ElasticaWrite [extensions/CirrusSearch] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/820190 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson)
[20:20:56] <wikibugs>	 (03Merged) 10jenkins-bot: Add explicit partitioning key to ElasticaWrite [extensions/CirrusSearch] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820191 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson)
[20:20:58] <wikibugs>	 (03Merged) 10jenkins-bot: Fix ReplyLinksController#teardown [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820192 (owner: 10Bartosz Dziewoński)
[20:21:04] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/819177 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[20:21:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312972)', diff saved to https://phabricator.wikimedia.org/P32253 and previous config saved to /var/cache/conftool/dbconfig/20220803-202125-marostegui.json
[20:21:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[20:21:29] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[20:21:35] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/819179 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[20:21:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[20:21:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T312972)', diff saved to https://phabricator.wikimedia.org/P32254 and previous config saved to /var/cache/conftool/dbconfig/20220803-202146-marostegui.json
[20:22:05] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/820215 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[20:22:32] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking)
[20:22:35] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking)
[20:22:51] <urbanecm>	 ebernhardson: your patches are at mwdebug1001 if they can be tested
[20:23:05] <urbanecm>	 MatmaRex: your DiscussionTools patch is at mwdebug1001 as well, can you test it please?
[20:23:34] <ebernhardson>	 urbanecm: hmm, nothing to really test. it all happens inside job queue stuff
[20:23:38] <MatmaRex>	 looking
[20:23:43] <urbanecm>	 ebernhardson: ack, i'll sync it then
[20:23:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:26:26] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:26:27] <MatmaRex>	 urbanecm: looks good
[20:26:35] <urbanecm>	 MatmaRex: thanks, i'll sync it
[20:26:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T312972)', diff saved to https://phabricator.wikimedia.org/P32255 and previous config saved to /var/cache/conftool/dbconfig/20220803-202658-marostegui.json
[20:27:02] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[20:27:28] <wikibugs>	 (03PS5) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950)
[20:27:36] <wikibugs>	 (03PS5) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche)
[20:27:39] <wikibugs>	 (03Merged) 10jenkins-bot: Update call to PageConfigFactory::create to use new signature [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820194 (https://phabricator.wikimedia.org/T314523) (owner: 10Bartosz Dziewoński)
[20:27:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:27:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:28:07] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host logstash2032.codfw.wmnet
[20:28:15] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/CirrusSearch/: 9961e9bc8f5873f8ddc8a11108de0a7bfcb14ae6: Add explicit partitioning key to ElasticaWrite (T314426) (duration: 03m 23s)
[20:28:18] <stashbot>	 T314426: Job queue for writes to cloudelastic falling behind - https://phabricator.wikimedia.org/T314426
[20:28:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:29:28] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[20:30:32] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[20:31:28] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/CirrusSearch/: 70a18f5846111a0dfe8ba473daf384cbb8e88804:  Add explicit partitioning key to ElasticaWrite (T314426) (duration: 03m 13s)
[20:31:37] <urbanecm>	 ebernhardson: your patch is live
[20:32:24] <urbanecm>	 MatmaRex: your second patch is at mwdebug1001 now, please have a look (note wikitech doesn't understand X-Wikimedia-debug, so it has to be checked elsewhere)
[20:32:32] <icinga-wm>	 RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms
[20:32:40] <icinga-wm>	 PROBLEM - ps1-b7-codfw-infeed-load-tower-A-phase-Z on ps1-b7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:33:04] <icinga-wm>	 PROBLEM - ps1-b7-codfw-infeed-load-tower-B-phase-Y on ps1-b7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:33:15] <wikibugs>	 (03CR) 10Brennen Bearnes: scap: stub out a checks.yaml (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/818231 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes)
[20:33:20] <MatmaRex>	 urbanecm: oh hmm, i'm not sure if it's reproducible anywhere else now
[20:33:28] <icinga-wm>	 PROBLEM - ps1-b7-codfw-infeed-load-tower-B-phase-Z on ps1-b7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:33:40] <MatmaRex>	 urbanecm: i'd need a private wiki on wmf.23
[20:33:44] <icinga-wm>	 PROBLEM - ps1-b7-codfw-infeed-load-tower-A-phase-X on ps1-b7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:33:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:34:40] <icinga-wm>	 PROBLEM - ps1-b7-codfw-infeed-load-tower-A-phase-Y on ps1-b7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:34:42] <ebernhardson>	 urbanecm: thanks
[20:34:50] <icinga-wm>	 PROBLEM - ps1-b7-codfw-infeed-load-tower-B-phase-X on ps1-b7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:34:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:34:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:35:02] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: Open firewall port to connecto to the LibreNMS database. [puppet] - 10https://gerrit.wikimedia.org/r/820215 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[20:35:24] <MatmaRex>	 oh, officewiki is on wmf.23
[20:35:35] <wikibugs>	 (03CR) 10Brennen Bearnes: scap: stub out a checks.yaml (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/818231 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes)
[20:35:42] <wikibugs>	 (03PS1) 10Cwhite: install_server: add logstash2032 provisioning data [puppet] - 10https://gerrit.wikimedia.org/r/820218 (https://phabricator.wikimedia.org/T313408)
[20:35:44] <wikibugs>	 (03PS1) 10Cwhite: hiera: add logstash2032 to cluster and set access [puppet] - 10https://gerrit.wikimedia.org/r/820219 (https://phabricator.wikimedia.org/T313408)
[20:35:46] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:36:02] <MatmaRex>	 but… i can't reproduce the original issue on officewiki
[20:36:25] <MatmaRex>	 urbanecm: so i think that due to some configuration i don't quite understand, wikitech might be the only affected wiki?
[20:36:40] <MatmaRex>	 wikitech, and my localhost :)
[20:36:40] <urbanecm>	 MatmaRex: ah, okay. i'll just sync it to production then
[20:36:48] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/DiscussionTools/: b840eef86837aed3e566885110e93b2ca9ab5f42: Fix ReplyLinksController#teardown (duration: 03m 27s)
[20:37:02] <wikibugs>	 (03PS1) 10Dduvall: devtools: Allow for scap deployment of scap [puppet] - 10https://gerrit.wikimedia.org/r/820220 (https://phabricator.wikimedia.org/T314195)
[20:37:19] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] phabricator: Stop managing /srv/phab/repos [puppet] - 10https://gerrit.wikimedia.org/r/820213 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[20:38:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:39:51] <logmsgbot>	 !log urbanecm@deploy1002 sync-file aborted: a804fe18f1e14795ba7836d3ebf6c361bb1538a7: Update call to PageConfigFactory::create to use new signature (T314523ú (duration: 00m 00s)
[21:01:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:01:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:02:10] <icinga-wm>	 RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:02:24] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:02:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:04:40] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:08:02] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[21:12:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T312972)', diff saved to https://phabricator.wikimedia.org/P32258 and previous config saved to /var/cache/conftool/dbconfig/20220803-211216-marostegui.json
[21:12:18] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[21:12:20] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[21:12:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[21:12:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32259 and previous config saved to /var/cache/conftool/dbconfig/20220803-211237-marostegui.json
[21:14:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32260 and previous config saved to /var/cache/conftool/dbconfig/20220803-211449-marostegui.json
[21:15:42] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[21:16:24] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:18:22] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7131 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[21:21:40] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Change CirrusSearchElasticaWrite partitioning key [deployment-charts] - 10https://gerrit.wikimedia.org/r/819752 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson)
[21:25:38] <wikibugs>	 (03Merged) 10jenkins-bot: Change CirrusSearchElasticaWrite partitioning key [deployment-charts] - 10https://gerrit.wikimedia.org/r/819752 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson)
[21:26:00] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:27:28] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: add logstash2032 to cluster and set access [puppet] - 10https://gerrit.wikimedia.org/r/820219 (https://phabricator.wikimedia.org/T313408) (owner: 10Cwhite)
[21:29:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P32261 and previous config saved to /var/cache/conftool/dbconfig/20220803-212955-marostegui.json
[21:30:04] <logmsgbot>	 !log ryankemper@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply
[21:30:06] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7260 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[21:30:42] <logmsgbot>	 !log ryankemper@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[21:32:26] <icinga-wm>	 RECOVERY - Cassandra instance data free space on restbase1017 is OK: DISK OK - free space: /srv/cassandra/instance-data 11515 MB (32% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[21:35:51] <logmsgbot>	 !log ryankemper@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[21:35:54] <logmsgbot>	 !log ryankemper@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[21:36:19] <ryankemper>	 ^ interesting, it doesn't seem to think there's changes in `eqiad`
[21:37:40] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - T314078
[21:37:43] <stashbot>	 T314078: Fix slow super_detect_noop code and monitor for future Elastic hangs - https://phabricator.wikimedia.org/T314078
[21:45:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P32262 and previous config saved to /var/cache/conftool/dbconfig/20220803-214501-marostegui.json
[21:46:48] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[21:53:46] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[21:57:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:00:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32263 and previous config saved to /var/cache/conftool/dbconfig/20220803-220007-marostegui.json
[22:00:10] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[22:00:11] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[22:00:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[22:00:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[22:00:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[22:00:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T312972)', diff saved to https://phabricator.wikimedia.org/P32264 and previous config saved to /var/cache/conftool/dbconfig/20220803-220057-marostegui.json
[22:02:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:03:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T312972)', diff saved to https://phabricator.wikimedia.org/P32265 and previous config saved to /var/cache/conftool/dbconfig/20220803-220309-marostegui.json
[22:08:14] <wikibugs>	 10SRE, 10Observability-Logging, 10vm-requests: logstash collector nodes in codfw not row redundant - https://phabricator.wikimedia.org/T313408 (10colewhite) 05Open→03Resolved Host provisioned.
[22:11:03] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[22:14:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Krinkle) >>! In T279664#8123041, @MatthewVernon wrote: > Are you proposing to do away with the concept of "active" DC, then? e.g. currently `swiftrepl` runs...
[22:18:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P32266 and previous config saved to /var/cache/conftool/dbconfig/20220803-221815-marostegui.json
[22:22:24] <wikibugs>	 10SRE, 10Observability-Logging, 10vm-requests, 10SRE Observability (FY2022/2023-Q1): logstash collector nodes in codfw not row redundant - https://phabricator.wikimedia.org/T313408 (10lmata)
[22:25:27] <wikibugs>	 (03PS1) 10BryanDavis: service::docker: Add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/820237 (https://phabricator.wikimedia.org/T306469)
[22:25:29] <wikibugs>	 (03PS1) 10BryanDavis: striker: route syslog output to ELK cluster via kafka [puppet] - 10https://gerrit.wikimedia.org/r/820238 (https://phabricator.wikimedia.org/T306469)
[22:26:39] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:27:51] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1): PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10lmata)
[22:33:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P32267 and previous config saved to /var/cache/conftool/dbconfig/20220803-223321-marostegui.json
[22:37:09] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) Ok, a bit more context about the maxage issue @Krinkle pointed out, plus possible solutions:  The CDN expiry code in MediaWiki is s...
[22:37:15] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[22:37:57] <ori>	 ^ Krinkle / bblack 
[22:38:04] <ori>	 T138093 I mean, not the etcd alert :)
[22:38:05] <stashbot>	 T138093: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093
[22:43:21] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) I suppose there's also a fourth option: special-case the parameter 'title' so that it always sorted into the first position. This w...
[22:48:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T312972)', diff saved to https://phabricator.wikimedia.org/P32268 and previous config saved to /var/cache/conftool/dbconfig/20220803-224827-marostegui.json
[22:48:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[22:48:34] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[22:48:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[22:49:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
[22:49:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
[22:49:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 9 hosts with reason: Maintenance
[22:49:37] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 9 hosts with reason: Maintenance
[22:49:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[22:50:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[22:50:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32269 and previous config saved to /var/cache/conftool/dbconfig/20220803-225015-marostegui.json
[22:50:53] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[22:55:49] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:55:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:08:14] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10BBlack) I think your "2 followed by 3" approach makes the most pragmatic sense.  We could/should probably eventually revisit the idea of...
[23:27:49] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:32:27] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add new pdu model for pdu in rack b3 b6-b8 and c1 [puppet] - 10https://gerrit.wikimedia.org/r/820123 (https://phabricator.wikimedia.org/T310070) (owner: 10Papaul)
[23:42:55] <jinxer-wm>	 (Device rebooted) firing: Alert for device ps1-b3-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[23:47:55] <jinxer-wm>	 (Device rebooted) resolved: Device ps1-b3-codfw.mgmt.codfw.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[23:48:46] <jinxer-wm>	 (Device rebooted) firing: Alert for device ps1-b6-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[23:49:14] <icinga-wm>	 RECOVERY - ps1-b7-codfw-infeed-load-tower-B-phase-Y on ps1-b7-codfw is OK: SNMP OK - ps1-b7-codfw-infeed-load-tower-B-phase-Y 83 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:49:16] <icinga-wm>	 RECOVERY - ps1-b7-codfw-infeed-load-tower-A-phase-Z on ps1-b7-codfw is OK: SNMP OK - ps1-b7-codfw-infeed-load-tower-A-phase-Z 315 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:49:16] <icinga-wm>	 RECOVERY - ps1-b7-codfw-infeed-load-tower-A-phase-Y on ps1-b7-codfw is OK: SNMP OK - ps1-b7-codfw-infeed-load-tower-A-phase-Y 44 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:49:16] <icinga-wm>	 RECOVERY - ps1-b7-codfw-infeed-load-tower-B-phase-X on ps1-b7-codfw is OK: SNMP OK - ps1-b7-codfw-infeed-load-tower-B-phase-X 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:49:16] <icinga-wm>	 RECOVERY - ps1-b7-codfw-infeed-load-tower-B-phase-Z on ps1-b7-codfw is OK: SNMP OK - ps1-b7-codfw-infeed-load-tower-B-phase-Z 263 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:50:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32270 and previous config saved to /var/cache/conftool/dbconfig/20220803-235030-marostegui.json
[23:50:34] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10Krinkle) >>! In T138093#8129915, @ori wrote: > […] >  > **Option 2: Include both the sorted and unsorted forms in the canonical purge UR...
[23:50:35] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[23:53:46] <jinxer-wm>	 (Device rebooted) resolved: Device ps1-b6-codfw.mgmt.codfw.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[23:55:00] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:55:49] <wikibugs>	 (03PS1) 10Krinkle: Remove redundant wgRC2UDPPrefix overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245
[23:56:05] <wikibugs>	 (03PS2) 10Krinkle: Remove legacy wgRC2UDPPrefix overrides for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245
[23:56:26] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:56:48] <wikibugs>	 (03CR) 10Krinkle: Remove legacy wgRC2UDPPrefix overrides for private wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 (owner: 10Krinkle)
[23:56:59] <thcipriani>	 jouncebot: now
[23:56:59] <jouncebot>	 No deployments scheduled for the next 6 hour(s) and 3 minute(s)
[23:57:05] <thcipriani>	 that's what I thought
[23:57:47] <jinxer-wm>	 (Device rebooted) firing: Alert for device ps1-b7-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[23:59:42] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1001.wikimedia.org with reason: service restart