[00:00:04] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [00:04:40] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [00:10:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Manage /etc/inputrc using Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819016 (https://phabricator.wikimedia.org/T293614) (owner: 10Lucas Werkmeister (WMDE)) [00:11:38] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [00:16:12] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:20:54] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [00:25:12] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-07-26 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:27:56] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [00:29:36] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-07-26 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:30:22] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-07-26 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:35:50] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-07-26 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:37:02] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [01:01:40] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-08-02 00:00:01 (3312 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:03:24] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [01:16:00] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:16:21] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [01:22:16] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul) [01:35:12] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [01:39:16] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:45] (JobUnavailable) firing: (10) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:10] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [01:50:45] (JobUnavailable) firing: (10) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:32] PROBLEM - Disk space on gitlab2002 is CRITICAL: DISK CRITICAL - free space: /srv/gitlab-backup 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab2002&var-datasource=codfw+prometheus/ops [02:04:18] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:07:07] (03PS1) 10Stang: tkwiki: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819774 (https://phabricator.wikimedia.org/T314435) [02:08:54] RECOVERY - ps1-b2-codfw-infeed-load-tower-B-phase-Y on ps1-b2-codfw is OK: SNMP OK - ps1-b2-codfw-infeed-load-tower-B-phase-Y 189 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:08:54] RECOVERY - ps1-b2-codfw-infeed-load-tower-A-phase-X on ps1-b2-codfw is OK: SNMP OK - ps1-b2-codfw-infeed-load-tower-A-phase-X 273 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:08:54] RECOVERY - ps1-b2-codfw-infeed-load-tower-A-phase-Y on ps1-b2-codfw is OK: SNMP OK - ps1-b2-codfw-infeed-load-tower-A-phase-Y 161 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:08:56] RECOVERY - ps1-b2-codfw-infeed-load-tower-B-phase-X on ps1-b2-codfw is OK: SNMP OK - ps1-b2-codfw-infeed-load-tower-B-phase-X 295 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:09:08] RECOVERY - ps1-b2-codfw-infeed-load-tower-A-phase-Z on ps1-b2-codfw is OK: SNMP OK - ps1-b2-codfw-infeed-load-tower-A-phase-Z 311 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:09:44] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-08-02 00:00:02 (3312 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:09:50] RECOVERY - ps1-b2-codfw-infeed-load-tower-B-phase-Z on ps1-b2-codfw is OK: SNMP OK - ps1-b2-codfw-infeed-load-tower-B-phase-Z 321 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:15:38] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [02:18:46] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:20:36] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:22:48] (Device rebooted) firing: Alert for device ps1-b4-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [02:27:48] (Device rebooted) resolved: Device ps1-b4-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [02:30:02] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10Sustainability: Enable bracketed-paste-mode for production shells (e.g. deployment, mwmaint) - https://phabricator.wikimedia.org/T293614 (10ori) `enable-bracketed-paste` is on by default starting with Bash 5.1, which is the version in bullseye.... [02:32:56] (Device rebooted) firing: Alert for device ps1-b5-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [02:33:28] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [02:34:54] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-08-02 00:00:01 (3333 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:37:56] (Device rebooted) resolved: Device ps1-b5-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [02:47:09] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul) [02:49:45] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [03:01:50] RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-08-02 00:00:02 (3333 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:06:06] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [03:34:22] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [03:36:32] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [03:37:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-whole-mediawiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:06] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [03:46:24] (03CR) 10Andrea Denisse: librenms: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810323 (owner: 10Muehlenhoff) [03:50:32] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [03:57:18] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [04:04:06] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [04:11:04] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [04:14:12] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [04:20:24] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [04:36:52] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [04:50:54] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [04:55:36] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:00:16] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:00:32] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [05:05:42] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:07:16] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:42] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:09:58] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [05:19:56] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [05:21:22] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:30:46] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:32:14] RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [05:32:56] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [05:33:24] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [05:33:48] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [05:34:14] RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [05:35:48] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [05:36:26] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [05:38:09] re-paged from 24 hours ago, resolving in VO, no action needed [05:38:13] yeah [05:38:20] We must have forgotten to resolve it yesterday [05:38:27] They can all be resolved [05:38:39] I thought they would resolve automatically yesterday once the process got back up [05:39:03] ✅ [05:39:29] thanks :* [05:40:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:40:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:40:45] (JobUnavailable) firing: (6) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:40:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [05:41:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [05:41:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T312972)', diff saved to https://phabricator.wikimedia.org/P32187 and previous config saved to /var/cache/conftool/dbconfig/20220803-054106-marostegui.json [05:41:09] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [05:45:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T312972)', diff saved to https://phabricator.wikimedia.org/P32188 and previous config saved to /var/cache/conftool/dbconfig/20220803-054526-marostegui.json [06:00:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P32189 and previous config saved to /var/cache/conftool/dbconfig/20220803-060032-marostegui.json [06:04:18] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:15:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P32190 and previous config saved to /var/cache/conftool/dbconfig/20220803-061538-marostegui.json [06:17:46] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:18:36] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:23:29] (03PS2) 10KartikMistry: CX: Set MT threshold for publishing in Armenian WP to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819227 (https://phabricator.wikimedia.org/T313208) [06:30:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T312972)', diff saved to https://phabricator.wikimedia.org/P32191 and previous config saved to /var/cache/conftool/dbconfig/20220803-063045-marostegui.json [06:30:49] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [06:30:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance [06:31:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance [06:31:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 13 hosts with reason: Maintenance [06:31:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 13 hosts with reason: Maintenance [06:31:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [06:31:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [06:31:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T312972)', diff saved to https://phabricator.wikimedia.org/P32192 and previous config saved to /var/cache/conftool/dbconfig/20220803-063148-marostegui.json [06:35:58] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [06:36:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T312972)', diff saved to https://phabricator.wikimedia.org/P32193 and previous config saved to /var/cache/conftool/dbconfig/20220803-063656-marostegui.json [06:37:00] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [06:37:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2013.codfw.wmnet to cluster codfw and group C [06:38:06] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:38:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2013.codfw.wmnet to cluster codfw and group C [06:39:08] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:39:24] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:45:59] !log power up centrallog2002 and prometheus2005 - T310070 [06:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:04] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [06:47:38] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [06:47:56] RECOVERY - Host prometheus2005 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms [06:48:24] PROBLEM - puppet last run on prometheus2005 is CRITICAL: CRITICAL: Puppet last ran 17 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:48:30] RECOVERY - Host centrallog2002 is UP: PING OK - Packet loss = 0%, RTA = 31.73 ms [06:49:28] PROBLEM - Check systemd state on centrallog2002 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:56] PROBLEM - Check systemd state on prometheus2005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:50:06] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:50:06] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:50:30] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:50:40] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:50:45] (JobUnavailable) firing: (5) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:52:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P32194 and previous config saved to /var/cache/conftool/dbconfig/20220803-065202-marostegui.json [06:52:14] RECOVERY - Check systemd state on prometheus2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:27] 10SRE, 10Traffic: anycast-healthchecker fails to start after a reboot and before a puppet run - https://phabricator.wikimedia.org/T314457 (10fgiunchedi) [06:54:25] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:54:42] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [06:54:44] RECOVERY - puppet last run on prometheus2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:55:47] thanos rule is me, prometheus coming back [06:56:46] !log grow sda/sdb 3 by 100G on thanos-be1002 - T314275 [06:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:49] T314275: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 [06:56:55] !log grow sda/sdb 3 by 100G on thanos-be2003 - T314275 [06:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:29] 10SRE, 10Traffic: anycast-healthchecker fails to start after a reboot and before a puppet run - https://phabricator.wikimedia.org/T314457 (10ayounsi) p:05Triage→03Low Agreed something needs to be fixed. The upside is that it works as a safeguard, preventing the service to receive live traffic before the fi... [06:59:22] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [06:59:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [07:00:04] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:17] * kart_ is here and will self deploy.. [07:00:27] !log draining ganeti2011 T311686 [07:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:30] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [07:00:54] 10SRE-swift-storage: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 (10fgiunchedi) [07:01:03] (03CR) 10KartikMistry: [C: 03+2] CX: Set MT threshold for publishing in Armenian WP to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819227 (https://phabricator.wikimedia.org/T313208) (owner: 10KartikMistry) [07:01:47] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [07:02:13] (03Merged) 10jenkins-bot: CX: Set MT threshold for publishing in Armenian WP to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819227 (https://phabricator.wikimedia.org/T313208) (owner: 10KartikMistry) [07:02:58] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] wancache: temporarily remove mc-gp2002 from the gutter pool [puppet] - 10https://gerrit.wikimedia.org/r/819634 (owner: 10Giuseppe Lavagetto) [07:04:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mcrouter_wancache: Replace mc2024 with mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/819656 (https://phabricator.wikimedia.org/T293012) (owner: 10RLazarus) [07:05:09] (03PS2) 10Giuseppe Lavagetto: mcrouter_wancache: Replace mc2024 with mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/819656 (https://phabricator.wikimedia.org/T293012) (owner: 10RLazarus) [07:05:32] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686 [07:05:35] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [07:05:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686 [07:06:42] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36601/console" [puppet] - 10https://gerrit.wikimedia.org/r/819697 (https://phabricator.wikimedia.org/T293012) (owner: 10RLazarus) [07:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P32195 and previous config saved to /var/cache/conftool/dbconfig/20220803-070708-marostegui.json [07:07:44] * kart_ Deploying after testing.. [07:09:37] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [07:09:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:10:18] (03PS1) 10Marostegui: Revert "mariadb: Disable notifications rack b5" [puppet] - 10https://gerrit.wikimedia.org/r/819791 [07:11:10] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:819227|CX: Set MT threshold for publishing in Armenian WP to 80% (T313208)]] (duration: 03m 49s) [07:11:13] T313208: Adjust the threshold for Armenian to prevent publishing when overall unmodified content is higher than 80% - https://phabricator.wikimedia.org/T313208 [07:12:57] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Disable notifications rack b5" [puppet] - 10https://gerrit.wikimedia.org/r/819791 (owner: 10Marostegui) [07:16:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: codfw pdu maintenance [07:16:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:16:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: codfw pdu maintenance [07:16:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:16:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: codfw pdu maintenance [07:17:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: codfw pdu maintenance [07:17:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es[2020-2022].codfw.wmnet with reason: codfw pdu maintenance [07:17:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es[2020-2022].codfw.wmnet with reason: codfw pdu maintenance [07:17:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance [07:18:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance [07:18:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2134,2160].codfw.wmnet with reason: codfw pdu maintenance [07:18:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2134,2160].codfw.wmnet with reason: codfw pdu maintenance [07:19:09] RECOVERY - Check systemd state on centrallog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance [07:19:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance [07:22:09] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T312972)', diff saved to https://phabricator.wikimedia.org/P32196 and previous config saved to /var/cache/conftool/dbconfig/20220803-072214-marostegui.json [07:22:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [07:22:18] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [07:22:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [07:22:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:22:36] (03PS1) 10Giuseppe Lavagetto: redis::multidc: actually install on mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/820065 [07:22:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:22:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T312972)', diff saved to https://phabricator.wikimedia.org/P32197 and previous config saved to /var/cache/conftool/dbconfig/20220803-072253-marostegui.json [07:23:07] (03PS1) 10Marostegui: mariadb: Disable notifications on codfw racks [puppet] - 10https://gerrit.wikimedia.org/r/820066 (https://phabricator.wikimedia.org/T310070) [07:23:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:23:51] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications on codfw racks [puppet] - 10https://gerrit.wikimedia.org/r/820066 (https://phabricator.wikimedia.org/T310070) (owner: 10Marostegui) [07:26:47] (03PS2) 10Giuseppe Lavagetto: redis::multidc: actually install on mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/820065 [07:28:00] (03PS1) 10Marostegui: mariadb: Disable notifications pdu C rows [puppet] - 10https://gerrit.wikimedia.org/r/820067 (https://phabricator.wikimedia.org/T310145) [07:28:29] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36603/console" [puppet] - 10https://gerrit.wikimedia.org/r/820065 (owner: 10Giuseppe Lavagetto) [07:29:23] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications pdu C rows [puppet] - 10https://gerrit.wikimedia.org/r/820067 (https://phabricator.wikimedia.org/T310145) (owner: 10Marostegui) [07:30:53] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] redis::multidc: actually install on mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/820065 (owner: 10Giuseppe Lavagetto) [07:33:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis) [07:36:32] PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:37:12] ^ expected [07:41:29] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Marostegui) Databases in c1 and c2 are ready [07:41:39] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Marostegui) Databases in the remaining B* racks are ready [07:44:28] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:41] (03PS1) 10Jcrespo: [WIP]Update section script to use a stable API rather than the db [software] - 10https://gerrit.wikimedia.org/r/820069 [07:46:25] (03PS1) 10Marostegui: instances.yaml: Remove db2072 [puppet] - 10https://gerrit.wikimedia.org/r/820070 (https://phabricator.wikimedia.org/T313911) [07:47:24] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2072 [puppet] - 10https://gerrit.wikimedia.org/r/820070 (https://phabricator.wikimedia.org/T313911) (owner: 10Marostegui) [07:48:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2072 from dbctl T313911', diff saved to https://phabricator.wikimedia.org/P32199 and previous config saved to /var/cache/conftool/dbconfig/20220803-074806-marostegui.json [07:48:10] T313911: decommission db2072 - https://phabricator.wikimedia.org/T313911 [07:49:21] (03PS1) 10Marostegui: mariadb: Decommission db2072 [puppet] - 10https://gerrit.wikimedia.org/r/820071 (https://phabricator.wikimedia.org/T313911) [07:49:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2072.codfw.wmnet [07:50:45] (03PS1) 10Filippo Giunchedi: o11y: alert on Icinga max check latency [alerts] - 10https://gerrit.wikimedia.org/r/820072 (https://phabricator.wikimedia.org/T314353) [07:51:35] (03PS1) 10Jcrespo: Add the possibility of searching racks for instances, too [software/pampinus] - 10https://gerrit.wikimedia.org/r/820073 (https://phabricator.wikimedia.org/T283017) [07:53:54] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [07:54:01] (03CR) 10Elukey: "Thanks Daniel! <3" [puppet] - 10https://gerrit.wikimedia.org/r/819649 (https://phabricator.wikimedia.org/T230178) (owner: 10Dzahn) [07:54:16] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [07:59:06] (03PS1) 10Jcrespo: [WIP]Add instance script with increased functionality over section [software] - 10https://gerrit.wikimedia.org/r/820074 (https://phabricator.wikimedia.org/T283017) [07:59:35] (03Abandoned) 10Jcrespo: [WIP]Update section script to use a stable API rather than the db [software] - 10https://gerrit.wikimedia.org/r/820069 (owner: 10Jcrespo) [08:04:04] 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1021 mgmt flapping - https://phabricator.wikimedia.org/T314413 (10Aklapper) [08:14:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2072 [puppet] - 10https://gerrit.wikimedia.org/r/820071 (https://phabricator.wikimedia.org/T313911) (owner: 10Marostegui) [08:15:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:15:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2072.codfw.wmnet [08:15:40] 10ops-codfw, 10decommission-hardware: decommission db2072 - https://phabricator.wikimedia.org/T313911 (10Marostegui) a:03Papaul [08:15:53] (03CR) 10Elukey: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/819565 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [08:15:57] 10ops-codfw, 10decommission-hardware: decommission db2072 - https://phabricator.wikimedia.org/T313911 (10Marostegui) @Papaul this is ready for you [08:17:15] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=(appservers|api)-ro,name=codfw [08:18:13] (03PS3) 10MarcoAurelio: Amend license request contact form per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) [08:19:07] !log stop db2098 for T310070 [08:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:10] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [08:23:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T312972)', diff saved to https://phabricator.wikimedia.org/P32200 and previous config saved to /var/cache/conftool/dbconfig/20220803-082318-marostegui.json [08:23:22] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [08:26:14] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [08:28:10] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:35:46] (03PS1) 10KartikMistry: Update cxserver to 2022-08-03-082610-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/820075 (https://phabricator.wikimedia.org/T308248) [08:36:06] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 118 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:38:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P32201 and previous config saved to /var/cache/conftool/dbconfig/20220803-083824-marostegui.json [08:40:03] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10Krinkle) >>! In T138093#8117992, @ori wrote: > Rolling this out to the high-traffic wikis will be a little bit tricky. When we turn it o... [08:41:31] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [08:42:08] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10Krinkle) >>! In T138093#8012474, @ori wrote: > Re-ordering duplicate query parameters could be problematic. […] This means that `?action... [08:46:04] PROBLEM - Check systemd state on db2107 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:29] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:49:02] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [08:51:46] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet [08:53:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:53:24] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:53:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P32202 and previous config saved to /var/cache/conftool/dbconfig/20220803-085330-marostegui.json [08:54:04] PROBLEM - Check systemd state on db2109 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:06] PROBLEM - Check systemd state on db2159 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:22] !log put the esams-drmrs link in service - T307221 [08:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:30] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:56:33] (03PS1) 10Jcrespo: Setup temporary arrangement for codfw snapshots durin PDU maintenance [puppet] - 10https://gerrit.wikimedia.org/r/820081 [08:57:27] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet [08:57:55] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet [08:58:18] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet [08:58:41] !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc2024.codfw.wmnet [08:58:51] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet [08:59:15] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2032.codfw.wmnet [08:59:38] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2031.codfw.wmnet [08:59:55] (03CR) 10CI reject: [V: 04-1] Setup temporary arrangement for codfw snapshots durin PDU maintenance [puppet] - 10https://gerrit.wikimedia.org/r/820081 (owner: 10Jcrespo) [09:00:21] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet [09:00:26] RECOVERY - Check systemd state on db2159 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:33] !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc2024.codfw.wmnet [09:01:18] PROBLEM - MariaDB Replica Lag: s7 on db2159 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 7790.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:01:45] (03PS1) 10Marostegui: dbproxy2002: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820082 [09:02:04] RECOVERY - Check systemd state on db2107 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:42] RECOVERY - Check systemd state on db2109 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:39] (03PS2) 10Jcrespo: Setup temporary arrangement for codfw snapshots durin PDU maintenance [puppet] - 10https://gerrit.wikimedia.org/r/820081 [09:04:19] 10SRE, 10Privacy Engineering, 10WMF-Legal, 10Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104 (10Aklapper) Superseded by {T310738}? [09:04:35] !log stop backup2006 backup2009 for T310070 [09:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:40] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [09:05:11] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet [09:06:11] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet [09:06:42] (03PS1) 10Giuseppe Lavagetto: Revert "wancache: temporarily remove mc-gp2002 from the gutter pool" [puppet] - 10https://gerrit.wikimedia.org/r/819792 [09:06:51] (03PS2) 10Giuseppe Lavagetto: Revert "wancache: temporarily remove mc-gp2002 from the gutter pool" [puppet] - 10https://gerrit.wikimedia.org/r/819792 [09:07:03] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "wancache: temporarily remove mc-gp2002 from the gutter pool" [puppet] - 10https://gerrit.wikimedia.org/r/819792 (owner: 10Giuseppe Lavagetto) [09:07:50] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2032.codfw.wmnet [09:08:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T312972)', diff saved to https://phabricator.wikimedia.org/P32203 and previous config saved to /var/cache/conftool/dbconfig/20220803-090836-marostegui.json [09:08:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [09:08:39] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [09:08:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [09:08:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [09:08:55] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2031.codfw.wmnet [09:09:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [09:09:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T312972)', diff saved to https://phabricator.wikimedia.org/P32204 and previous config saved to /var/cache/conftool/dbconfig/20220803-090912-marostegui.json [09:10:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T312972)', diff saved to https://phabricator.wikimedia.org/P32205 and previous config saved to /var/cache/conftool/dbconfig/20220803-091019-marostegui.json [09:10:56] !log configure BGP on the esams-drmrs link - T307221 [09:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:51] (03PS4) 10Jbond: reposync: don't ask for confirmation in dry run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114 [09:14:41] (03CR) 10Marostegui: [C: 03+2] dbproxy2002: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820082 (owner: 10Marostegui) [09:15:23] 10SRE-swift-storage, 10Commons: New broken files (premature end of file) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T284188 (10Aklapper) 05Open→03Declined Unfortunately closing this Phabricator task as no further information has been provided. If this still happens, please... [09:15:46] !log power on mc2024 [09:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:54] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [09:17:27] (03PS3) 10Btullis: Add DNS SRV records for dse-k8s etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/819565 (https://phabricator.wikimedia.org/T313129) [09:18:00] RECOVERY - Host mc2024 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [09:19:39] (03PS1) 10Marostegui: instances.yaml: Remove db2090 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820085 (https://phabricator.wikimedia.org/T314109) [09:20:23] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet [09:20:33] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2090 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820085 (https://phabricator.wikimedia.org/T314109) (owner: 10Marostegui) [09:20:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2090 from dbctl T314109', diff saved to https://phabricator.wikimedia.org/P32206 and previous config saved to /var/cache/conftool/dbconfig/20220803-092053-marostegui.json [09:20:56] T314109: decommission db2090 - https://phabricator.wikimedia.org/T314109 [09:21:36] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [09:21:41] (03PS1) 10Vgutierrez: lvs: Use conf2005 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/820086 (https://phabricator.wikimedia.org/T310070) [09:22:28] !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route [09:22:28] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) [09:23:17] !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route [09:23:19] !log oblivian@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:24:05] !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route [09:24:05] !log oblivian@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:24:13] !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route [09:24:15] !log oblivian@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:24:32] RECOVERY - MariaDB Replica Lag: s7 on db2159 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:25:20] (03PS1) 10Marostegui: site.pp: Decommission db2090 [puppet] - 10https://gerrit.wikimedia.org/r/820087 (https://phabricator.wikimedia.org/T314109) [09:25:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P32207 and previous config saved to /var/cache/conftool/dbconfig/20220803-092525-marostegui.json [09:25:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2090.codfw.wmnet [09:28:03] (03PS1) 10Ayounsi: add include for 2620:0:862:fe08::/64 PTR [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) [09:29:17] (03CR) 10CI reject: [V: 04-1] add include for 2620:0:862:fe08::/64 PTR [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi) [09:29:50] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [09:31:13] (03CR) 10Ayounsi: "The -1 is expected as the included file is created by the `sre.netbox.dns` cookbook." [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi) [09:31:25] (03PS1) 10Btullis: Add an option to use the PKI for etcd intra-cluster certificates [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) [09:31:34] PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance=mc2024 site=codfw tunnel=mc1042_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [09:32:34] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [09:32:37] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [09:33:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [09:33:52] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@674bb8b]: (no justification provided) [09:33:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:33:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2090.codfw.wmnet [09:33:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] lvs: Use conf2005 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/820086 (https://phabricator.wikimedia.org/T310070) (owner: 10Vgutierrez) [09:34:01] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@674bb8b]: (no justification provided) (duration: 00m 10s) [09:34:58] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [09:35:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2011.codfw.wmnet with OS bullseye [09:35:07] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS bullseye [09:35:19] (03CR) 10Btullis: [C: 03+2] Add DNS SRV records for dse-k8s etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/819565 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [09:40:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P32208 and previous config saved to /var/cache/conftool/dbconfig/20220803-094032-marostegui.json [09:41:50] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:42:07] !log kubectl cordon kubestage2002 [09:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:25] (03CR) 10Vgutierrez: [C: 03+2] lvs: Use conf2005 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/820086 (https://phabricator.wikimedia.org/T310070) (owner: 10Vgutierrez) [09:43:47] !log rolling restart of pybal in codfw lvs instances - T310070 [09:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:50] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [09:43:50] !log kubectl drain --ignore-daemonsets kubestage2002.codfw.wmnet [09:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:16] (03PS13) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [09:44:36] (03PS1) 10Giuseppe Lavagetto: Switch read traffic for etcd to eqiad [dns] - 10https://gerrit.wikimedia.org/r/820092 [09:44:52] (03CR) 10Marostegui: [C: 03+2] site.pp: Decommission db2090 [puppet] - 10https://gerrit.wikimedia.org/r/820087 (https://phabricator.wikimedia.org/T314109) (owner: 10Marostegui) [09:45:43] (03CR) 10CI reject: [V: 04-1] Switch read traffic for etcd to eqiad [dns] - 10https://gerrit.wikimedia.org/r/820092 (owner: 10Giuseppe Lavagetto) [09:46:06] !log kubectl cordon kubernetes2020.codfw.wmnet kubernetes2009.codfw.wmnet kubernetes2010.codfw.wmnet kubernetes2011.codfw.wmnet kubernetes2012.codfw.wmnet [09:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:08] 10ops-codfw, 10decommission-hardware: decommission db2090 - https://phabricator.wikimedia.org/T314109 (10Marostegui) @papaul this is ready for you [09:46:12] 10ops-codfw, 10decommission-hardware: decommission db2090 - https://phabricator.wikimedia.org/T314109 (10Marostegui) a:03Papaul [09:47:49] !log kubectl drain --ignore-daemonsets kubernetes2020.codfw.wmnet [09:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:38] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [09:48:46] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 49 hosts with reason: PDU swap [09:49:19] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 49 hosts with reason: PDU swap [09:49:37] (03CR) 10Btullis: "I have made a related wikitech edit about this change:" [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [09:50:26] !log kubectl drain --ignore-daemonsets --delete-local-data kubernetes2009.codfw.wmnet [09:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:55] (03CR) 10Jbond: [C: 03+2] reposync: don't ask for confirmation in dry run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114 (owner: 10Jbond) [09:52:52] !log kubectl drain --ignore-daemonsets --delete-local-data kubernetes2010.codfw.wmnet [09:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:18] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2011.codfw.wmnet with reason: host reimage [09:54:48] !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route [09:54:48] !log oblivian@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:54:49] !Log kubectl drain --ignore-daemonsets --delete-local-data kubernetes2011.codfw.wmnet [09:54:56] !log kubectl drain --ignore-daemonsets --delete-local-data kubernetes2011.codfw.wmnet [09:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:32] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=restbase2027.codfw.wmnet [09:55:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T312972)', diff saved to https://phabricator.wikimedia.org/P32209 and previous config saved to /var/cache/conftool/dbconfig/20220803-095538-marostegui.json [09:55:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [09:55:41] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [09:55:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [09:56:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T312972)', diff saved to https://phabricator.wikimedia.org/P32210 and previous config saved to /var/cache/conftool/dbconfig/20220803-095559-marostegui.json [09:56:26] !log kubectl drain --ignore-daemonsets --delete-local-data kubernetes2012.codfw.wmnet [09:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:41] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2021.codfw.wmnet [09:56:42] (03CR) 10MMandere: [C: 03+1] "LGTM... The new DC order is correct as per measurements recorded in the aforementioned excel sheet." [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) (owner: 10BCornwall) [09:56:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2011.codfw.wmnet with reason: host reimage [09:57:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T312972)', diff saved to https://phabricator.wikimedia.org/P32211 and previous config saved to /var/cache/conftool/dbconfig/20220803-095706-marostegui.json [09:58:00] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10fgiunchedi) [09:58:44] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation={create,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:00:33] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10fgiunchedi) [10:01:14] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:04:30] 10SRE-swift-storage, 10User-fgiunchedi: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 (10fgiunchedi) [10:04:52] (03CR) 10Jbond: "look good but a few nits" [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [10:12:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P32212 and previous config saved to /var/cache/conftool/dbconfig/20220803-101212-marostegui.json [10:14:28] !log oblivian@cumin1001 START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors [10:14:31] !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors [10:14:35] !log oblivian@cumin1001 START - Cookbook sre.dns.wipe-cache proton.discovery.wmnet on all recursors [10:14:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2011.codfw.wmnet with OS bullseye [10:14:39] !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proton.discovery.wmnet on all recursors [10:14:41] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS bullseye completed: - ganeti2011 (**PASS**) - Downtimed on... [10:20:19] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubestage2002.codfw.wmnet [10:21:35] (03CR) 10Jcrespo: [C: 03+2] Setup temporary arrangement for codfw snapshots durin PDU maintenance [puppet] - 10https://gerrit.wikimedia.org/r/820081 (owner: 10Jcrespo) [10:22:19] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2020.codfw.wmnet [10:22:32] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2009.codfw.wmnet [10:22:43] (03PS1) 10Jcrespo: Revert "Setup temporary arrangement for codfw snapshots durin PDU maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/819794 [10:22:48] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2010.codfw.wmnet [10:23:00] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2011.codfw.wmnet [10:23:09] (03CR) 10Jcrespo: [C: 04-1] "Wait for PDU maintenance to complete to revert." [puppet] - 10https://gerrit.wikimedia.org/r/819794 (owner: 10Jcrespo) [10:23:15] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2012.codfw.wmnet [10:26:36] (03CR) 10Filippo Giunchedi: [C: 03+1] Add VictorOps CLI tool & escalate_unpaged command (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis) [10:27:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P32213 and previous config saved to /var/cache/conftool/dbconfig/20220803-102718-marostegui.json [10:27:48] !log oblivian@cumin1001 START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors [10:27:51] !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors [10:27:56] !log oblivian@cumin1001 START - Cookbook sre.dns.wipe-cache proton.discovery.wmnet on all recursors [10:27:59] !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proton.discovery.wmnet on all recursors [10:28:12] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [10:28:12] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:29:52] !log oblivian@cumin1001 START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors [10:29:55] !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors [10:30:00] !log oblivian@cumin1001 START - Cookbook sre.dns.wipe-cache proton.discovery.wmnet on all recursors [10:30:03] !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proton.discovery.wmnet on all recursors [10:30:41] (03PS1) 10Marostegui: mariadb: Productionize db2176 [puppet] - 10https://gerrit.wikimedia.org/r/820098 (https://phabricator.wikimedia.org/T311494) [10:32:37] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:04] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2176 [puppet] - 10https://gerrit.wikimedia.org/r/820098 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [10:37:38] !log shutdown kubestage2002 kubernetes2020 kubernetes2009 kubernetes2010 kubernetes2011 kubernetes2012 [10:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:09] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2014-2015,2021-2022].codfw.wmnet with reason: PDU maintenance [10:38:24] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2014-2015,2021-2022].codfw.wmnet with reason: PDU maintenance [10:38:44] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2022.codfw.wmnet [10:40:23] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase201[45].codfw.wmnet [10:41:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet [10:42:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T312972)', diff saved to https://phabricator.wikimedia.org/P32215 and previous config saved to /var/cache/conftool/dbconfig/20220803-104224-marostegui.json [10:42:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [10:42:28] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [10:42:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [10:42:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T312972)', diff saved to https://phabricator.wikimedia.org/P32216 and previous config saved to /var/cache/conftool/dbconfig/20220803-104246-marostegui.json [10:44:19] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [10:44:58] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:45:45] (JobUnavailable) firing: (5) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:46:47] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubestage2002.codfw.wmnet with reason: PDU swap [10:47:01] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubestage2002.codfw.wmnet with reason: PDU swap [10:47:41] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [10:48:03] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:48:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:49:58] (KubernetesCalicoDown) firing: kubernetes2009.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:50:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet [10:53:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2011.codfw.wmnet to cluster codfw and group C [10:53:59] (03PS1) 10Filippo Giunchedi: WIP klaxon: run VO escalation of unpaged incidents [puppet] - 10https://gerrit.wikimedia.org/r/820100 (https://phabricator.wikimedia.org/T313603) [10:54:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2011.codfw.wmnet to cluster codfw and group C [10:54:36] (03CR) 10Filippo Giunchedi: "An outline to run escalate_unpaged periodically, let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/820100 (https://phabricator.wikimedia.org/T313603) (owner: 10Filippo Giunchedi) [10:54:41] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:54:58] (KubernetesCalicoDown) firing: (3) kubernetes2009.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:56:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "This can probably be abandoned per ori in T293614#8126628 (leaving it open for additional feedback for a few days)." [puppet] - 10https://gerrit.wikimedia.org/r/819016 (https://phabricator.wikimedia.org/T293614) (owner: 10Lucas Werkmeister (WMDE)) [10:56:36] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10Sustainability: Enable bracketed-paste-mode for production shells (e.g. deployment, mwmaint) - https://phabricator.wikimedia.org/T293614 (10Lucas_Werkmeister_WMDE) That’s great, thanks! In that case I’m happy to close this task and live with my... [10:59:44] (03PS1) 10Marostegui: install_server: Do not reimage db2176 [puppet] - 10https://gerrit.wikimedia.org/r/820101 (https://phabricator.wikimedia.org/T311494) [11:00:46] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2176 [puppet] - 10https://gerrit.wikimedia.org/r/820101 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [11:02:21] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10fgiunchedi) [11:04:52] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10fgiunchedi) [11:08:26] (03PS1) 10Ladsgroup: site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102 [11:09:47] (03CR) 10CI reject: [V: 04-1] site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102 (owner: 10Ladsgroup) [11:09:58] (KubernetesCalicoDown) firing: (2) kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:12:32] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [11:12:52] (03CR) 10Btullis: Add an option to use the PKI for etcd intra-cluster certificates (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [11:13:12] (03CR) 10CDanis: Add VictorOps CLI tool & escalate_unpaged command (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis) [11:13:14] (03PS2) 10Ladsgroup: site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102 [11:13:21] 10SRE, 10Observability-Logging, 10vm-requests: logstash collector nodes in codfw not row redundant - https://phabricator.wikimedia.org/T313408 (10MoritzMuehlenhoff) >>! In T313408#8124432, @MoritzMuehlenhoff wrote: > Looks good, could you use Row C but wait two days? I'm currently reimaging codfw ganeti node... [11:13:27] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [11:14:53] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [11:14:55] (03CR) 10CDanis: [C: 03+1] WIP klaxon: run VO escalation of unpaged incidents [puppet] - 10https://gerrit.wikimedia.org/r/820100 (https://phabricator.wikimedia.org/T313603) (owner: 10Filippo Giunchedi) [11:15:35] (03PS1) 10Jbond: peeringdb: refactor suggestions [software/spicerack] - 10https://gerrit.wikimedia.org/r/820103 [11:16:44] (03CR) 10Jbond: PeeringDB API: initial commit (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [11:17:41] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:17:41] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:17:43] <_joe_> !log depooling codfw services from all traffic [11:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:57] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:17:59] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:18:17] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:18:35] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:18:45] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [11:18:46] (03CR) 10Jbond: PeeringDB API: initial commit (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [11:18:47] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:21:03] (03PS1) 10Vgutierrez: trafficserver: Track time spent in or waiting for plugins [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) [11:21:32] (03PS2) 10Vgutierrez: trafficserver: Track time spent in or waiting for plugins [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) [11:21:58] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=restbase-async [11:22:10] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=restbase-backend [11:22:47] (03CR) 10CI reject: [V: 04-1] peeringdb: refactor suggestions [software/spicerack] - 10https://gerrit.wikimedia.org/r/820103 (owner: 10Jbond) [11:22:54] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=kartotherian [11:23:26] (03CR) 10Marostegui: "It looks good, let's run a PCC just in case." [puppet] - 10https://gerrit.wikimedia.org/r/820102 (owner: 10Ladsgroup) [11:24:01] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [11:24:22] 10SRE: Allow Wikimedia Maps usage on a private project for an university. - https://phabricator.wikimedia.org/T306467 (10Aklapper) 05Stalled→03Declined Unfortunately closing this Phabricator task as no further information has been provided. @HermidaVazquez: After you have provided the information asked for... [11:24:23] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [11:24:35] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:25:47] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:25:47] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:25:47] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:09] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=wdqs [11:26:26] (03CR) 10Jbond: Add an option to use the PKI for etcd intra-cluster certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [11:29:20] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10hnowlan) [11:31:28] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:31:28] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:31:28] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:32:19] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [11:35:30] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:37:57] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2022.codfw.wmnet [11:38:10] (03CR) 10Jbond: add include for 2620:0:862:fe08::/64 PTR (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi) [11:38:10] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host restbase2022.codfw.wmnet [11:41:29] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=(kubernetes2020.codfw.wmnet|kubernetes2009.codfw.wmnet|kubernetes2010.codfw.wmnet|kubernetes2011.codfw.wmnet|kubernetes2012.codfw.wmnet|kubestage2002.codfw.wmnet) [11:42:17] (03CR) 10Jbond: add include for 2620:0:862:fe08::/64 PTR (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi) [11:43:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T312972)', diff saved to https://phabricator.wikimedia.org/P32217 and previous config saved to /var/cache/conftool/dbconfig/20220803-114301-marostegui.json [11:43:04] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [11:45:14] (03PS3) 10Ladsgroup: site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102 [11:45:19] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820102 (owner: 10Ladsgroup) [11:46:36] !log jayme@cumin1001 conftool action : set/weight=10; selector: name=(kubernetes2019.codfw.wmnet|kubernetes2021.codfw.wmnet|kubernetes2022.codfw.wmnet|kubernetes2018.codfw.wmnet|kubernetes2020.codfw.wmnet) [11:47:35] (03PS14) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [11:49:16] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:49:16] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:49:24] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:49:26] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:49:30] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cumin2002.codfw.wmnet with reason: PDU maintenance, T310145 [11:49:34] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [11:49:35] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:49:35] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:49:36] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:49:38] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:49:38] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:49:42] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:49:55] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cumin2002.codfw.wmnet with reason: PDU maintenance, T310145 [11:50:28] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:50:30] (03PS15) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [11:50:48] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:51:40] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:54:16] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [11:54:37] (03PS1) 10Marostegui: instances.yaml: Add db2176 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820110 (https://phabricator.wikimedia.org/T311494) [11:54:50] (03PS2) 10Giuseppe Lavagetto: Switch read traffic for etcd to eqiad [dns] - 10https://gerrit.wikimedia.org/r/820092 [11:55:55] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2176 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820110 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [11:56:18] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [11:56:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Switch read traffic for etcd to eqiad [dns] - 10https://gerrit.wikimedia.org/r/820092 (owner: 10Giuseppe Lavagetto) [11:57:01] (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [11:57:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2176 to s1 T311494', diff saved to https://phabricator.wikimedia.org/P32218 and previous config saved to /var/cache/conftool/dbconfig/20220803-115706-marostegui.json [11:57:10] T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494 [11:58:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P32219 and previous config saved to /var/cache/conftool/dbconfig/20220803-115807-marostegui.json [11:59:07] (03PS1) 10Btullis: cfssl [puppet] - 10https://gerrit.wikimedia.org/r/820111 [11:59:40] (03PS1) 10Marostegui: db2176: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820112 (https://phabricator.wikimedia.org/T311494) [12:00:30] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:30] (03CR) 10Btullis: Add an option to use the PKI for etcd intra-cluster certificates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [12:00:38] (03CR) 10Marostegui: [C: 03+2] db2176: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820112 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [12:01:33] (03Abandoned) 10Btullis: cfssl [puppet] - 10https://gerrit.wikimedia.org/r/820111 (owner: 10Btullis) [12:02:28] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10JMeybohm) [12:03:05] (03PS1) 10Jbond: puppetmaster: add new offline puppetmaster[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/820113 (https://phabricator.wikimedia.org/T314136) [12:04:12] (03PS2) 10Jbond: puppetmaster: add new offline puppetmaster[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/820113 (https://phabricator.wikimedia.org/T314136) [12:04:56] (03CR) 10Btullis: "My latest patchset didn't get updated properly. Will fix shortly." [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [12:05:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36606/console" [puppet] - 10https://gerrit.wikimedia.org/r/820113 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [12:07:18] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: regen-zoom-level-tilerator-regen.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:03] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10JMeybohm) [12:12:24] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:13:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P32220 and previous config saved to /var/cache/conftool/dbconfig/20220803-121313-marostegui.json [12:14:45] 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi) [12:16:34] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@614f7b2]: (no justification provided) [12:16:45] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@614f7b2]: (no justification provided) (duration: 00m 11s) [12:17:01] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10fgiunchedi) [12:19:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10EChetty) [12:19:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10EChetty) [12:19:33] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10fgiunchedi) [12:19:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10EChetty) [12:21:18] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:21:52] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10JMeybohm) [12:23:01] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10fgiunchedi) [12:23:54] 10SRE, 10serviceops, 10Release-Engineering-Team (Doing): Reduce latency of new Scap releases - https://phabricator.wikimedia.org/T292646 (10jnuche) [12:24:25] jayme: sorry we had conflicting edits on T310145 :( [12:24:26] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [12:24:50] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10MoritzMuehlenhoff) [12:25:44] godog: but I was first :D [12:26:27] jayme: haha! were you though? I see your edit obliterated some of my edits too [12:26:38] anyways I'll fix it [12:26:46] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10JMeybohm) [12:26:56] ah you did, nicely done jayme [12:26:59] godog: oops, okay [12:27:08] hope your stuff is still in there now godog [12:27:17] it is! 👍 [12:27:25] (03PS1) 10Giuseppe Lavagetto: Reducing codfw mobileapps during maintenance [deployment-charts] - 10https://gerrit.wikimedia.org/r/820116 [12:27:33] cool [12:27:39] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10JMeybohm) [12:27:57] just another edit to be sure it is persistent :-p [12:28:11] lolz [12:28:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T312972)', diff saved to https://phabricator.wikimedia.org/P32221 and previous config saved to /var/cache/conftool/dbconfig/20220803-122819-marostegui.json [12:28:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [12:28:25] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [12:28:28] (03PS1) 10Hnowlan: jobqueue: increase num_workers to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/820117 (https://phabricator.wikimedia.org/T300914) [12:28:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [12:28:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [12:29:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [12:29:24] (03CR) 10JMeybohm: [C: 03+1] Reducing codfw mobileapps during maintenance [deployment-charts] - 10https://gerrit.wikimedia.org/r/820116 (owner: 10Giuseppe Lavagetto) [12:29:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T312972)', diff saved to https://phabricator.wikimedia.org/P32222 and previous config saved to /var/cache/conftool/dbconfig/20220803-122929-marostegui.json [12:30:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Reducing codfw mobileapps during maintenance [deployment-charts] - 10https://gerrit.wikimedia.org/r/820116 (owner: 10Giuseppe Lavagetto) [12:31:19] (03CR) 10Ladsgroup: "Ran it on anything that had db in it:" [puppet] - 10https://gerrit.wikimedia.org/r/820102 (owner: 10Ladsgroup) [12:31:24] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10fgiunchedi) [12:33:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T312972)', diff saved to https://phabricator.wikimedia.org/P32223 and previous config saved to /var/cache/conftool/dbconfig/20220803-123336-marostegui.json [12:33:40] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [12:35:03] (03PS3) 10Vgutierrez: trafficserver: Track time spent in or waiting for plugins [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) [12:36:36] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:36:43] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10fgiunchedi) [12:36:43] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:39:33] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2090 - https://phabricator.wikimedia.org/T314109 (10Papaul) [12:40:03] !log uploaded openjdk-8 8u342-b07-1~deb10u1 to component/jdk8 for buster-wikimedia (rebuild of latest Java 8 security update) [12:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:12] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2072 - https://phabricator.wikimedia.org/T313911 (10Papaul) [12:41:15] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2088 - https://phabricator.wikimedia.org/T313797 (10Papaul) [12:48:28] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:48:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P32224 and previous config saved to /var/cache/conftool/dbconfig/20220803-124842-marostegui.json [12:48:58] (03PS2) 10Btullis: Add an option to use the PKI for etcd intra-cluster certificates [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) [12:49:50] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:51:44] (03Abandoned) 10Ssingh: hiera: replace mc2024 with mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/819706 (https://phabricator.wikimedia.org/T293012) (owner: 10Ssingh) [12:53:24] (03PS1) 10Ssingh: Revert "Revert "Depool codfw for PDU upgrade"" [dns] - 10https://gerrit.wikimedia.org/r/819798 [12:56:29] (03PS1) 10Marostegui: mariadb: Productionize db2177 [puppet] - 10https://gerrit.wikimedia.org/r/820120 (https://phabricator.wikimedia.org/T311494) [12:56:52] !log pt1979@cumin1001 START - Cookbook sre.dns.netbox [12:58:20] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:59:10] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2043.codfw.wmnet with reason: T310070 [12:59:13] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [12:59:24] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2043.codfw.wmnet with reason: T310070 [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T1300). [13:00:05] hauskatze: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] o/ [13:00:41] (03PS1) 10Vgutierrez: Backport several fixex scheduled for 9.1.3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/820121 (https://phabricator.wikimedia.org/T309651) [13:01:01] (03PS2) 10Vgutierrez: Backport several fixes scheduled for 9.1.3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/820121 (https://phabricator.wikimedia.org/T309651) [13:01:03] I can deploy today [13:01:05] hi hauskatze [13:01:10] (03CR) 10Urbanecm: [C: 03+2] Amend license request contact form per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) (owner: 10MarcoAurelio) [13:03:09] (03PS1) 10Btullis: Add DHCP details for an-airflow1004 [puppet] - 10https://gerrit.wikimedia.org/r/820122 (https://phabricator.wikimedia.org/T312858) [13:03:45] (03Merged) 10jenkins-bot: Amend license request contact form per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) (owner: 10MarcoAurelio) [13:03:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P32226 and previous config saved to /var/cache/conftool/dbconfig/20220803-130348-marostegui.json [13:04:01] o/ [13:04:22] hauskatze: pulled to mwdebug1001, please have a look [13:04:26] ack [13:04:33] (03CR) 10Btullis: [C: 03+2] Add DHCP details for an-airflow1004 [puppet] - 10https://gerrit.wikimedia.org/r/820122 (https://phabricator.wikimedia.org/T312858) (owner: 10Btullis) [13:04:50] !log pt1979@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:05:04] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2044.codfw.wmnet with reason: T310070 [13:05:08] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [13:05:18] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2044.codfw.wmnet with reason: T310070 [13:05:56] (03PS2) 10Marostegui: mariadb: Productionize db2177 [puppet] - 10https://gerrit.wikimedia.org/r/820120 (https://phabricator.wikimedia.org/T311494) [13:06:31] urbanecm: checked - form appears okay with the fields as requested [13:06:38] great, syncing [13:06:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:07:03] I'll ask tzatziki to send a test email once synced just in case, so they can verify everything works [13:07:10] sounds good [13:07:15] (03PS1) 10Papaul: Add new pdu model for pdu in rack b3 b6-b8 and c1 [puppet] - 10https://gerrit.wikimedia.org/r/820123 (https://phabricator.wikimedia.org/T310070) [13:07:26] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:07:29] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2177 [puppet] - 10https://gerrit.wikimedia.org/r/820120 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [13:07:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:07:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:07:57] (03PS2) 10CDanis: Add VictorOps CLI tool & escalate_unpaged command [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) [13:08:14] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10bking) [13:08:31] (03CR) 10Ayounsi: add include for 2620:0:862:fe08::/64 PTR (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi) [13:09:32] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on kafka-logging2003.codfw.wmnet with reason: pdu [13:09:46] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on kafka-logging2003.codfw.wmnet with reason: pdu [13:10:42] scap says `ssh: connect to host mw2259.codfw.wmnet port 22: Connection timed out` [13:11:14] urbanecm: very likely expected, a bunch of codfw machines are being powered off for maintenance on the power distribution equipment [13:11:29] yep, just mentioning as a good practice [13:11:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:11:36] 👍 [13:11:44] (03CR) 10Ssingh: [C: 03+1] "LGTM based on my (limited) understanding of this; but I did compare it against similar code and looks good." [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:12:23] I guess the codfw machines can be scap pulled afterwards? [13:12:33] (or perhaps that’s even part of the power-on procedure? idk) [13:12:37] (03PS1) 10Marostegui: site.pp: Remove insetup role from db2176 [puppet] - 10https://gerrit.wikimedia.org/r/820125 (https://phabricator.wikimedia.org/T311494) [13:12:41] !log introduce puppetmaster[12]004 for now as offline [13:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:54] i find it interesting that mw2259 is apparently still a part of the jobrunner cluster? https://config-master.wikimedia.org/pybal/codfw/jobrunner [13:13:37] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup role from db2176 [puppet] - 10https://gerrit.wikimedia.org/r/820125 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [13:13:49] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36607/console" [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:13:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster: add new offline puppetmaster[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/820113 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [13:13:56] and `13:12:29 39 apaches had sync errors` [13:14:20] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:14:28] Lucas_WMDE: they are set pooled=inactive in pybal so they won't begin serving traffic automatically upon bootup, and yeah, serviceops will do `scap pull` afterwards, that is standard procedure for returning a machine to service [13:14:35] (03PS4) 10Eigyan: [config]: Add click event logging for mobile and desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852) [13:14:37] (03CR) 10Ssingh: [C: 03+2] Revert "Revert "Depool codfw for PDU upgrade"" [dns] - 10https://gerrit.wikimedia.org/r/819798 (owner: 10Ssingh) [13:14:39] cdanis: good to know, thanks! [13:15:19] (03PS2) 10Ssingh: Revert "Revert "Depool codfw for PDU upgrade"" [dns] - 10https://gerrit.wikimedia.org/r/819798 [13:16:27] !log urbanecm@deploy1002 Synchronized wmf-config/MetaContactPages.php: f89f02e306a1fa580fa41ba56de978f4208ea672: Amend license request contact form per Legal (T303359) (duration: 09m 27s) [13:16:31] T303359: Remove items from Meta-Wiki page [[Special:Contact/requestlicense]] - https://phabricator.wikimedia.org/T303359 [13:16:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:16:57] \o/ [13:16:58] cdanis: if i'm not misunderstanding https://config-master.wikimedia.org/pybal/codfw/jobrunner, mw2259 is still pooled (and unreachable). it's definitely not pooled=inactive, because scap should ignore inactive hosts. [13:17:13] oh, the jobrunners, hm [13:17:18] _joe_: ^ [13:18:03] !log depool codfw for PDU upgrade: CR 819798 [13:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T312972)', diff saved to https://phabricator.wikimedia.org/P32227 and previous config saved to /var/cache/conftool/dbconfig/20220803-131855-marostegui.json [13:18:56] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7419 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:18:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [13:19:00] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [13:19:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [13:19:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T312972)', diff saved to https://phabricator.wikimedia.org/P32228 and previous config saved to /var/cache/conftool/dbconfig/20220803-131916-marostegui.json [13:19:26] urbanecm: well, I can at least verify that mw2259 is in an affected rack (B3) [13:19:37] (03PS1) 10Btullis: Configure an-airflow1004 to install with buster [puppet] - 10https://gerrit.wikimedia.org/r/820126 (https://phabricator.wikimedia.org/T312858) [13:20:52] thanks cdanis [13:21:10] fwiw there is also an apiserver (mw2317) that's also pooled & unreachable. didn't check the rest of the hosts though. [13:21:14] (03PS4) 10Vgutierrez: trafficserver: Track time spent in or waiting for plugins [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) [13:21:18] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:21:45] ok, that's also in the affected set of hosts [13:21:50] I'll make sure those get marked as depooled soon [13:21:54] thanks! [13:22:20] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36608/console" [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:22:32] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10bking) [13:23:21] urbanecm: actually j.ayme from serviceops will check that rn :) thanks for noting [13:23:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:23:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:24:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:24:23] (03PS16) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [13:24:53] (03CR) 10Filippo Giunchedi: [C: 03+1] Add VictorOps CLI tool & escalate_unpaged command [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis) [13:25:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T312972)', diff saved to https://phabricator.wikimedia.org/P32229 and previous config saved to /var/cache/conftool/dbconfig/20220803-132524-marostegui.json [13:25:27] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [13:25:40] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7195 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:26:42] (03CR) 10Ssingh: [C: 03+1] "The new addition looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:27:31] (03CR) 10Btullis: [C: 03+2] Configure an-airflow1004 to install with buster [puppet] - 10https://gerrit.wikimedia.org/r/820126 (https://phabricator.wikimedia.org/T312858) (owner: 10Btullis) [13:28:39] (03PS3) 10CDanis: Add VictorOps CLI tool & escalate_unpaged command [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) [13:28:49] (03CR) 10CDanis: [C: 03+2] Add VictorOps CLI tool & escalate_unpaged command [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis) [13:30:43] !log installing Java 8 security updates for Buster [13:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:04] (03Merged) 10jenkins-bot: Add VictorOps CLI tool & escalate_unpaged command [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis) [13:31:50] (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [13:32:05] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Track time spent in or waiting for plugins [puppet] - 10https://gerrit.wikimedia.org/r/820104 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:32:18] (03PS1) 10Jbond: readme: add w-sre to channels [puppet] - 10https://gerrit.wikimedia.org/r/820128 [13:32:37] (03CR) 10Jbond: [V: 03+2 C: 03+2] readme: add w-sre to channels [puppet] - 10https://gerrit.wikimedia.org/r/820128 (owner: 10Jbond) [13:35:50] (03PS1) 10Muehlenhoff: Add Cumin aliases for ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/820129 [13:37:16] (03PS7) 10Jbond: C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134 (https://phabricator.wikimedia.org/T138093) [13:37:53] cdanis: urbanecm: looks like the data on config-master is "old" confctl shots the nodes as pooled=inactive [13:38:01] ahhhhh [13:38:08] enlighten me :-) [13:38:11] hm. [13:38:15] interesting [13:38:31] scap trying to talk to them still is really interesting [13:38:37] thanks for checking jayme. then scap probably also uses the old confctl state? [13:39:10] is the thing that generates dsh files for scap -- and config-master state -- perhaps *not* redirected to read from eqiad instead of codfw? [13:39:54] (03PS2) 10Filippo Giunchedi: klaxon: run VO escalation of unpaged incidents [puppet] - 10https://gerrit.wikimedia.org/r/820100 (https://phabricator.wikimedia.org/T313603) [13:40:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P32230 and previous config saved to /var/cache/conftool/dbconfig/20220803-134030-marostegui.json [13:41:53] (03CR) 10Filippo Giunchedi: [C: 03+2] klaxon: run VO escalation of unpaged incidents [puppet] - 10https://gerrit.wikimedia.org/r/820100 (https://phabricator.wikimedia.org/T313603) (owner: 10Filippo Giunchedi) [13:42:25] this is a part of the infra I never got too deep into sadly [13:42:26] hmm...but etcd should be replicated anyways. Just the clients not talking to codfw etcd AIUI [13:43:22] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] klaxon: run VO escalation of unpaged incidents [puppet] - 10https://gerrit.wikimedia.org/r/820100 (https://phabricator.wikimedia.org/T313603) (owner: 10Filippo Giunchedi) [13:43:56] ✔️ cdanis@deploy1002.eqiad.wmnet ~ 🕤☕ grep mw2259 /etc/dsh/group/* [13:43:58] /etc/dsh/group/jobrunner:mw2259.codfw.wmnet [13:43:59] (03PS1) 10Vgutierrez: trafficserver: Adjust (Total|Active)PluginTime milestones [puppet] - 10https://gerrit.wikimedia.org/r/820132 (https://phabricator.wikimedia.org/T309651) [13:44:00] /etc/dsh/group/mediawiki-installation:mw2259.codfw.wmnet [13:44:03] /etc/dsh/group/scap-proxies:mw2259.codfw.wmnet [13:44:04] /etc/dsh/group/scap_targets:mw2259.codfw.wmnet [13:44:38] confd is producing a bunch of errors on deploy1002 [13:44:39] (03CR) 10EllenR: [C: 03+1] "LGTM, but I would never have guessed that you would put click event in InitializeSettings-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852) (owner: 10Eigyan) [13:44:41] Aug 3 13:43:57 deploy1002 confd[32274]: 2022-08-03T13:43:57Z deploy1002 /usr/bin/confd[32274]: ERROR client: etcd cluster is unavailable or misconfigured; error #0: dial tcp: lookup conf1004.eqiad.wmnet on 10.3.0.1:53: no such host [13:44:44] Aug 3 13:43:57 deploy1002 confd[32274]: ; error #1: dial tcp: lookup conf1006.eqiad.wmnet on 10.3.0.1:53: no such host [13:44:45] Aug 3 13:43:57 deploy1002 confd[32274]: ; error #2: dial tcp: lookup conf1005.eqiad.wmnet on 10.3.0.1:53: no such host [13:45:57] hmm conf1005 and conf1006 have been decomm'ed per https://phabricator.wikimedia.org/T311408 [13:46:29] new ones are conf100[789] [13:46:41] maybe confd needs a restart to pick up changes to the srv record? [13:46:43] yeah and the SRV record looks correct [13:46:48] that would be really asinine but I'm going to do it [13:46:58] !log ✔️ cdanis@deploy1002.eqiad.wmnet ~ 🕙☕ sudo systemctl restart confd [13:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:20] heh [13:47:26] on restart it did update everything [13:47:28] that worked? [13:47:30] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Adjust (Total|Active)PluginTime milestones [puppet] - 10https://gerrit.wikimedia.org/r/820132 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:47:33] ✔️ cdanis@deploy1002.eqiad.wmnet ~ 🕙☕ grep mw2259 /etc/dsh/group/* [13:47:35] /etc/dsh/group/scap-proxies:mw2259.codfw.wmnet [13:47:38] /etc/dsh/group/scap_targets:mw2259.codfw.wmnet [13:47:39] no longer in jobrunners [13:48:15] jayme: so it looks like we need to do a global restart of confd too [13:48:33] old confctl data in non obvious was is not the first time it hit us- I think something weird happened on a dc switchover too [13:48:41] *non-obvious ways [13:48:48] and probably uh [13:48:58] figure out how to monitor for confd being stuck in an unhealthy state for *days* [13:49:05] +1 [13:49:15] both on etcd and on client side [13:50:17] (03PS1) 10Filippo Giunchedi: klaxon: fix escalate_unpaged usage [puppet] - 10https://gerrit.wikimedia.org/r/820135 (https://phabricator.wikimedia.org/T313603) [13:50:33] cdanis: ^ [13:50:53] (03CR) 10CDanis: [C: 03+2] klaxon: fix escalate_unpaged usage [puppet] - 10https://gerrit.wikimedia.org/r/820135 (https://phabricator.wikimedia.org/T313603) (owner: 10Filippo Giunchedi) [13:53:49] does anyone object to me restarting confd across the fleet? [13:54:07] RECOVERY - puppetmaster backend https on puppetmaster1004 is OK: HTTP OK: Status line output matched 400 - 417 bytes in 1.447 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:54:23] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:54:46] cdanis: thanks for looking & taking care - I have none [13:54:50] ^ cdanis let's wait to see what that is [13:55:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P32231 and previous config saved to /var/cache/conftool/dbconfig/20220803-135536-marostegui.json [13:56:22] I see an increase in latency, but not a recent one: https://grafana.wikimedia.org/goto/BH-MUfk4z?orgId=1 [13:56:59] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:57:13] heh [13:57:21] I'm going to proceed [13:57:39] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕙☕ sudo cumin 'P{R:Class = Confd}' 'systemctl restart confd' [13:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:23] done [13:58:49] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11345 MB (32% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:00:13] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [14:01:31] (03PS1) 10Cwhite: scap: add option to selectivlely disable bootstrapping [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) [14:02:24] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [14:02:43] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:04:02] (03PS2) 10Cwhite: scap: add option to selectivlely disable bootstrapping [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) [14:04:13] (03PS1) 10Ladsgroup: auto_schema: Start depooling codfw replicas [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) [14:04:19] ah, I just noticed it is codfw only [14:04:22] sorry [14:06:34] !log installing freetype security updates on bullseye [14:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T312972)', diff saved to https://phabricator.wikimedia.org/P32232 and previous config saved to /var/cache/conftool/dbconfig/20220803-141042-marostegui.json [14:10:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance [14:10:46] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [14:10:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance [14:11:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1109 (T312972)', diff saved to https://phabricator.wikimedia.org/P32233 and previous config saved to /var/cache/conftool/dbconfig/20220803-141103-marostegui.json [14:11:39] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:12:29] (03PS1) 10Jbond: C:puppetmaster::ssl: ensure we generate the ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/820144 (https://phabricator.wikimedia.org/T314136) [14:12:31] (03CR) 10Marostegui: [C: 03+1] site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102 (owner: 10Ladsgroup) [14:13:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T312972)', diff saved to https://phabricator.wikimedia.org/P32234 and previous config saved to /var/cache/conftool/dbconfig/20220803-141310-marostegui.json [14:13:37] (03CR) 10CI reject: [V: 04-1] C:puppetmaster::ssl: ensure we generate the ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/820144 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [14:14:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36609/console" [puppet] - 10https://gerrit.wikimedia.org/r/820144 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [14:16:03] (03PS1) 10CDanis: Add helpful note about confd [dns] - 10https://gerrit.wikimedia.org/r/820145 (https://phabricator.wikimedia.org/T314489) [14:16:51] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:18:11] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [14:18:33] urbanecm: again thanks very much for pointing that out earlier, I filed our own task & a bug upstream against confd [14:19:10] cdanis: thanks for figuring it out so quickly! [14:19:27] (03PS2) 10Jbond: C:puppetmaster::ssl: ensure we generate the ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/820144 (https://phabricator.wikimedia.org/T314136) [14:19:32] apparently this is a known issue but not a well-documented or well-known-enough issue :? [14:19:35] :/ [14:20:41] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:20:50] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Michael) Glancing at the repository, I'm not sure if there is anything that you need from us to migrate `wikibase/termbox` on Wikidata? Though I'm not very familiar with this particular part... [14:20:52] (03CR) 10CDanis: [C: 03+2] Add helpful note about confd [dns] - 10https://gerrit.wikimedia.org/r/820145 (https://phabricator.wikimedia.org/T314489) (owner: 10CDanis) [14:22:50] (03CR) 10Jbond: "couple more minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [14:25:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:27:14] !log upgrading ganeti/esams to Ganeti 3.0.2 T312637 [14:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:18] T312637: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 [14:28:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P32235 and previous config saved to /var/cache/conftool/dbconfig/20220803-142816-marostegui.json [14:28:21] !log power off thumbor2003 and thumbor2004 [14:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:29] PROBLEM - Host thumbor2004 is DOWN: PING CRITICAL - Packet loss = 100% [14:28:36] (03CR) 10Jbond: Add helpful note about confd (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820145 (https://phabricator.wikimedia.org/T314489) (owner: 10CDanis) [14:29:51] PROBLEM - Host thumbor2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:57] PROBLEM - Host conf2004 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:06] this is fine [14:30:44] (03CR) 10CDanis: [C: 03+2] Add helpful note about confd (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820145 (https://phabricator.wikimedia.org/T314489) (owner: 10CDanis) [14:31:08] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on thumbor[2003-2004].codfw.wmnet with reason: PDU swap [14:31:21] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs[2005-2008].codfw.wmnet with reason: PDU work [14:31:22] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on thumbor[2003-2004].codfw.wmnet with reason: PDU swap [14:31:37] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs[2005-2008].codfw.wmnet with reason: PDU work [14:31:45] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=664eda2d-5203-44ca-92c1-3213c3996b5f) set by mvernon@cumin1001 for 1 day, 0:00:00 on 4 host(s) and t... [14:31:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36610/console" [puppet] - 10https://gerrit.wikimedia.org/r/820144 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [14:32:08] !log shutdown aqs200[5-8] prior to PDU work T310070 [14:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:11] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [14:33:39] (03CR) 10Btullis: Add an option to use the PKI for etcd intra-cluster certificates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [14:33:51] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2029.codfw.wmnet with reason: T310070 [14:34:05] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2029.codfw.wmnet with reason: T310070 [14:34:29] PROBLEM - Host mw2314.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:34:29] PROBLEM - Host mw2315.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:34:29] PROBLEM - Host mw2317.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:34:29] PROBLEM - Host mw2316.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:34:29] PROBLEM - Host mw2318.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:34:30] PROBLEM - Host mw2319.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:34:30] PROBLEM - Host mw2321.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:34:31] PROBLEM - Host mw2320.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:34:42] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:puppetmaster::ssl: ensure we generate the ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/820144 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [14:35:07] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:35:47] PROBLEM - Host mw2324.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:36:27] PROBLEM - Host mw2311.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:37:11] PROBLEM - Host mw2322.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:37:12] PROBLEM - Host mw2323.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:38:03] (03CR) 10Jbond: Add an option to use the PKI for etcd intra-cluster certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [14:38:14] (03CR) 10Jbond: [C: 03+1] Add an option to use the PKI for etcd intra-cluster certificates [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [14:39:24] (03PS3) 10Btullis: Add an option to use the PKI for etcd intra-cluster certificates [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) [14:39:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [14:39:59] PROBLEM - Host mw2310.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:39:59] PROBLEM - Host mw2312.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:40:02] PROBLEM - Host mw2313.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:42:15] PROBLEM - Host ores2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:27] PROBLEM - Host restbase2021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:42:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10fgiunchedi) a:05herron→03RobH Hi @robh, re: racking since this is an expansion please allocate to new rows (compared to existing kafka-logging... [14:42:57] PROBLEM - Host mw2259.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:42:57] PROBLEM - Host mw2261.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:42:57] PROBLEM - Host mw2262.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:42:57] PROBLEM - Host mw2260.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:42:57] PROBLEM - Host mw2264.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:42:58] PROBLEM - Host mw2263.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:42:58] PROBLEM - Host mw2266.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:42:59] PROBLEM - Host mw2265.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:42:59] PROBLEM - Host mw2268.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:43:00] PROBLEM - Host mw2267.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:43:00] PROBLEM - Host mw2269.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:43:07] PROBLEM - Host db2123.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:43:18] do we not downtime mgmt? heh [14:43:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P32236 and previous config saved to /var/cache/conftool/dbconfig/20220803-144322-marostegui.json [14:44:31] PROBLEM - Host db2108.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:44:42] PROBLEM - Host es2021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:09] RECOVERY - puppetmaster backend https on puppetmaster2004 is OK: HTTP OK: Status line output matched 400 - 417 bytes in 1.522 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:45:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:45:33] PROBLEM - Host conf2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:33] PROBLEM - Host mw2270.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:46:00] (JobUnavailable) firing: (5) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:46:02] (03PS1) 10Jbond: puppet::ssl: update private permisions not public [puppet] - 10https://gerrit.wikimedia.org/r/820147 [14:46:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:46:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:46:14] (03CR) 10Jbond: [C: 03+2] puppet::ssl: update private permisions not public [puppet] - 10https://gerrit.wikimedia.org/r/820147 (owner: 10Jbond) [14:46:17] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet::ssl: update private permisions not public [puppet] - 10https://gerrit.wikimedia.org/r/820147 (owner: 10Jbond) [14:46:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:47:21] PROBLEM - Host ps1-b3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:29] PROBLEM - Host ores2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:47:31] !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.21 (duration: 06m 13s) [14:48:45] PROBLEM - Host thumbor2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:49:13] PROBLEM - Juniper alarms on asw-b-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:49:21] RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [14:50:02] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:50:37] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7169 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:51:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:52:37] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:52:37] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:52:51] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:53:25] !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.19 (duration: 05m 37s) [14:53:59] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:54:09] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [14:55:53] (03CR) 10Jbond: add include for 2620:0:862:fe08::/64 PTR (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi) [14:55:57] 10SRE-OnFire, 10observability, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10lmata) p:05Triage→03Medium [14:56:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:56:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:56:09] RECOVERY - Check systemd state on search-loader1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:15] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:56:43] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:58:05] 10SRE-OnFire, 10observability, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) p:05Medium→03High [14:58:13] 10SRE-OnFire, 10observability, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) 05Open→03Resolved [14:58:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T312972)', diff saved to https://phabricator.wikimedia.org/P32237 and previous config saved to /var/cache/conftool/dbconfig/20220803-145828-marostegui.json [14:58:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [14:58:32] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [14:58:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [14:58:49] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T312972)', diff saved to https://phabricator.wikimedia.org/P32238 and previous config saved to /var/cache/conftool/dbconfig/20220803-145849-marostegui.json [14:59:50] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on mc2023.codfw.wmnet with reason: PDU swap [14:59:52] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mc2023.codfw.wmnet with reason: PDU swap [14:59:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T312972)', diff saved to https://phabricator.wikimedia.org/P32239 and previous config saved to /var/cache/conftool/dbconfig/20220803-145956-marostegui.json [15:00:38] (03CR) 10Cwhite: "This is only one option. I've also seen it "disabled" by setting scap::deployment_server to $fqdn." [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) (owner: 10Cwhite) [15:01:19] RECOVERY - Host thumbor2003.mgmt is UP: PING WARNING - Packet loss = 66%, RTA = 501.02 ms [15:01:21] RECOVERY - Juniper alarms on asw-b-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:01:21] RECOVERY - Host mw2259.mgmt is UP: PING OK - Packet loss = 0%, RTA = 60.35 ms [15:01:21] RECOVERY - Host mw2260.mgmt is UP: PING OK - Packet loss = 0%, RTA = 55.06 ms [15:01:21] RECOVERY - Host mw2261.mgmt is UP: PING OK - Packet loss = 0%, RTA = 55.12 ms [15:01:22] RECOVERY - Host mw2262.mgmt is UP: PING OK - Packet loss = 0%, RTA = 52.35 ms [15:01:22] RECOVERY - Host mw2264.mgmt is UP: PING OK - Packet loss = 0%, RTA = 54.46 ms [15:01:22] RECOVERY - Host mw2263.mgmt is UP: PING OK - Packet loss = 0%, RTA = 52.39 ms [15:01:23] RECOVERY - Host mw2266.mgmt is UP: PING OK - Packet loss = 0%, RTA = 48.71 ms [15:01:23] RECOVERY - Host mw2265.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.53 ms [15:01:24] RECOVERY - Host mw2268.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.33 ms [15:01:24] RECOVERY - Host mw2269.mgmt is UP: PING OK - Packet loss = 0%, RTA = 40.75 ms [15:01:25] RECOVERY - Host mw2267.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.81 ms [15:02:25] RECOVERY - Host ores2003 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [15:03:07] RECOVERY - Host es2021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.24 ms [15:03:11] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7340 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:04:00] !log power off mc2023 [15:04:01] RECOVERY - Host conf2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.52 ms [15:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:01] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [15:05:41] RECOVERY - Host conf2004 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [15:06:27] RECOVERY - Host ores2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.55 ms [15:07:01] RECOVERY - Host thumbor2003 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms [15:07:07] RECOVERY - Host mw2311.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.53 ms [15:07:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:07:27] RECOVERY - Host restbase2021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.84 ms [15:07:37] RECOVERY - Host thumbor2004 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [15:08:01] RECOVERY - Host db2123.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [15:08:39] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) >>! In T138093#8127054, @Krinkle wrote: > Quick drive-by note here, feel free to ignore if a false alarm. MediaWiki has a concept o... [15:09:27] RECOVERY - Host db2108.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [15:10:01] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for conf2004.codfw.wmnet [15:10:01] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for conf2004.codfw.wmnet [15:10:07] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7402 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:10:12] jouncebot: nowandnext [15:10:13] No deployments scheduled for the next 2 hour(s) and 49 minute(s) [15:10:13] In 2 hour(s) and 49 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T1800) [15:10:13] In 2 hour(s) and 49 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T1800) [15:10:13] (KubernetesCalicoDown) firing: (2) kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:10:21] (03CR) 10Urbanecm: [C: 03+2] ServiceImageRecommendationProvider: Add extra logging when no JSON response received [extensions/GrowthExperiments] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819075 (https://phabricator.wikimedia.org/T313973) (owner: 10Urbanecm) [15:10:27] win 44 [15:10:34] lose 666 [15:10:35] RECOVERY - Host mw2315.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [15:10:35] RECOVERY - Host mw2314.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [15:10:35] RECOVERY - Host mw2317.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [15:10:35] RECOVERY - Host mw2318.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.85 ms [15:10:35] RECOVERY - Host mw2316.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.65 ms [15:10:39] RECOVERY - Host mw2310.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.36 ms [15:10:39] RECOVERY - Host mw2312.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.00 ms [15:10:41] RECOVERY - Host mw2313.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [15:11:06] (03CR) 10Thcipriani: [C: 03+1] gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [15:11:29] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:11:40] (03CR) 10Ahmon Dancy: [C: 03+1] "I have no objection to this approach. I'm okay with merging it today for expediency. We can change it later if needed." [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) (owner: 10Cwhite) [15:12:53] RECOVERY - Host mw2324.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.14 ms [15:13:23] (03CR) 10Thcipriani: [C: 03+1] "looks right from the scap side anyway 😐" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [15:14:15] RECOVERY - Host mw2322.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [15:14:15] RECOVERY - Host mw2323.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.66 ms [15:15:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P32240 and previous config saved to /var/cache/conftool/dbconfig/20220803-151502-marostegui.json [15:16:25] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:17:03] RECOVERY - Host mw2321.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [15:17:03] RECOVERY - Host mw2319.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [15:17:03] RECOVERY - Host mw2320.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [15:17:14] (03PS1) 10JMeybohm: Revert "Switch read traffic for etcd to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/819802 [15:17:27] (03CR) 10CI reject: [V: 04-1] Revert "Switch read traffic for etcd to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/819802 (owner: 10JMeybohm) [15:18:49] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2002 is CRITICAL: 70 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2002 [15:19:01] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2030.codfw.wmnet with reason: T310070 [15:19:04] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [15:19:15] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2030.codfw.wmnet with reason: T310070 [15:19:34] (03PS2) 10JMeybohm: Revert "Switch read traffic for etcd to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/819802 [15:20:03] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2001 is CRITICAL: 70 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001 [15:21:33] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2021.codfw.wmnet [15:21:55] (03CR) 10JMeybohm: [C: 03+2] Revert "Switch read traffic for etcd to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/819802 (owner: 10JMeybohm) [15:22:23] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10bking) [15:23:21] RECOVERY - Host mw2270.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.63 ms [15:24:37] !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd._tcp.codfw.wmnet on all recursors [15:24:40] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.codfw.wmnet on all recursors [15:24:41] !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd._tcp.ulsfo.wmnet on all recursors [15:24:44] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.ulsfo.wmnet on all recursors [15:24:45] !log jayme@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd._tcp.eqsin.wmnet on all recursors [15:24:49] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.eqsin.wmnet on all recursors [15:25:22] TIL sre.dns.wipe-cache :) [15:26:27] RECOVERY - MegaRAID on ms-be2067 is OK: OK: optimal, 23 logical, 23 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:26:43] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [15:27:29] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:27:41] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:30:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P32241 and previous config saved to /var/cache/conftool/dbconfig/20220803-153009-marostegui.json [15:30:33] PROBLEM - Host wcqs2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:54] !log clearing ats-be cache on cp6016 - T309651 [15:30:55] (03CR) 10Cwhite: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) (owner: 10Cwhite) [15:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:56] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [15:31:21] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [15:31:51] (03Merged) 10jenkins-bot: ServiceImageRecommendationProvider: Add extra logging when no JSON response received [extensions/GrowthExperiments] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819075 (https://phabricator.wikimedia.org/T313973) (owner: 10Urbanecm) [15:32:09] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2024.codfw.wmnet [15:32:30] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2024.codfw.wmnet with reason: PDU maintenance [15:32:44] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2024.codfw.wmnet with reason: PDU maintenance [15:32:45] PROBLEM - Host mc2023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:33:03] PROBLEM - Host wcqs2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:33:05] (03CR) 10BCornwall: [C: 03+2] geodns: Map out African countries by DC latency [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) (owner: 10BCornwall) [15:33:39] (03PS4) 10BCornwall: geodns: Map out African countries by DC latency [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) [15:34:11] PROBLEM - Host aqs2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:34:12] PROBLEM - Host aqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:34:12] PROBLEM - Host aqs2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:34:13] PROBLEM - Host aqs2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:34:52] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2009.codfw.wmnet [15:34:53] PROBLEM - Host db2161.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:34:53] PROBLEM - Host db2162.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:35:07] PROBLEM - Host db2096.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:35:11] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on maps2009.codfw.wmnet with reason: PDU maintenance [15:35:24] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on maps2009.codfw.wmnet with reason: PDU maintenance [15:35:33] PROBLEM - Host rdb2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:36:07] (03CR) 10Jaime Nuche: [C: 03+1] "Looks good, can you point to the puppet config you're using to config those scap hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) (owner: 10Cwhite) [15:36:17] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:36:22] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/GrowthExperiments/includes/NewcomerTasks/AddImage/ServiceImageRecommendationProvider.php: 4438957e78e0012aff646e52dc16a4fb796cfd6b: ServiceImageRecommendationProvider: Add extra logging when no JSON response received (T313973) (duration: 03m 04s) [15:36:25] T313973: Exception: Invalid JSON response for page: Espejo - https://phabricator.wikimedia.org/T313973 [15:36:32] * urbanecm is done [15:36:53] !log powercycle kafka-logging2003 - not responsive to serial console [15:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:58] (KubernetesCalicoDown) firing: ml-serve2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:37:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:37:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:38:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:38:15] PROBLEM - Host mw2331.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:38:41] PROBLEM - Host restbase2024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:38:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:38:45] PROBLEM - Host kubernetes2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:38:45] PROBLEM - Host kubernetes2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:38:50] !log clearing ats-be cache on cp6008 - T309651 [15:38:51] PROBLEM - Host kubernetes2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:53] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [15:38:55] PROBLEM - Host maps2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:08] (03CR) 10Cwhite: [C: 03+2] scap: add option to selectivlely disable bootstrapping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820139 (https://phabricator.wikimedia.org/T303559) (owner: 10Cwhite) [15:39:13] PROBLEM - Host ml-serve2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:51] PROBLEM - Host mw2326.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:51] PROBLEM - Host mw2327.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:52] PROBLEM - Host mw2328.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:52] PROBLEM - Host mw2332.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:52] PROBLEM - Host mw2325.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:52] PROBLEM - Host mw2329.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:52] PROBLEM - Host mw2330.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:53] PROBLEM - Host mw2333.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:53] PROBLEM - Host mw2334.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:40:20] (03PS1) 10Jbond: Add helpful note about confd [dns] - 10https://gerrit.wikimedia.org/r/820155 (https://phabricator.wikimedia.org/T314489) [15:40:33] PROBLEM - Host db2124.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:40:41] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001 [15:40:47] (03CR) 10Jbond: Add helpful note about confd (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820145 (https://phabricator.wikimedia.org/T314489) (owner: 10CDanis) [15:40:53] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:01] (03CR) 10Jbond: [C: 03+2] Add helpful note about confd [dns] - 10https://gerrit.wikimedia.org/r/820155 (https://phabricator.wikimedia.org/T314489) (owner: 10Jbond) [15:41:03] PROBLEM - Host db2134.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:27] (03CR) 10CDanis: [C: 03+1] "thanks!" [dns] - 10https://gerrit.wikimedia.org/r/820155 (https://phabricator.wikimedia.org/T314489) (owner: 10Jbond) [15:41:33] PROBLEM - Host db2098.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:39] PROBLEM - Host db2111.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:39] PROBLEM - Host db2110.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:45] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2002 [15:41:47] PROBLEM - Host dbproxy2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:42:47] PROBLEM - Juniper alarms on asw-b-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:42:57] RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [15:44:05] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:59] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:45:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T312972)', diff saved to https://phabricator.wikimedia.org/P32242 and previous config saved to /var/cache/conftool/dbconfig/20220803-154515-marostegui.json [15:45:21] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [15:45:43] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:45:45] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:47] PROBLEM - tileratorui on maps2006 is CRITICAL: connect to address 10.192.16.31 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [15:45:57] 10SRE, 10SRE-Access-Requests: Requesting access to WMCS for Slavina Stefanova - https://phabricator.wikimedia.org/T313934 (10Slst2020) 05In progress→03Resolved [15:46:49] PROBLEM - tilerator on maps2006 is CRITICAL: connect to address 10.192.16.31 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [15:48:51] PROBLEM - Disk space on ms-be2067 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdz1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops [15:51:23] (03PS5) 10BCornwall: geodns: Map out African countries by DC latency [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) [15:51:29] RECOVERY - tilerator on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [15:52:09] RECOVERY - Juniper alarms on asw-b-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:52:45] !log pooling mw2259-2270 again [15:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:27] RECOVERY - Host db2124.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.94 ms [15:53:53] RECOVERY - Host db2134.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.32 ms [15:54:21] RECOVERY - Host db2098.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [15:54:29] RECOVERY - Host db2111.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.87 ms [15:54:29] RECOVERY - Host db2110.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.87 ms [15:54:37] RECOVERY - Host dbproxy2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [15:55:39] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10hnowlan) [15:57:43] RECOVERY - Host mw2333.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.82 ms [15:58:09] RECOVERY - Host kubernetes2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 188.24 ms [15:58:09] RECOVERY - Host kubernetes2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 187.74 ms [15:58:16] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet with reason: PDU work [15:58:17] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:21] RECOVERY - Host maps2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.56 ms [15:58:31] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet with reason: PDU work [15:58:37] RECOVERY - Host ml-serve2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.93 ms [15:58:42] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=353f1e46-07cd-47d6-9a06-44c3a93b5b51) set by mvernon@cumin1001 for 1 day, 0:00:00 on 3 host(s) and t... [15:59:09] !log shutdown ms-be20[33,47],thanos-be2002 prior to PDU work T310070 [15:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:12] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [15:59:15] RECOVERY - Host mw2326.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.86 ms [15:59:15] RECOVERY - Host mw2327.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [15:59:15] RECOVERY - Host mw2328.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.64 ms [15:59:15] RECOVERY - Host mw2332.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms [15:59:15] RECOVERY - Host mw2331.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.63 ms [15:59:16] RECOVERY - Host mw2325.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [15:59:16] RECOVERY - Host mw2330.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [15:59:17] RECOVERY - Host mw2329.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [15:59:17] RECOVERY - Host mw2334.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [15:59:29] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:39] RECOVERY - tileratorui on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [16:00:09] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10Dzahn) @ERayfield Thank you for the update. Hope the get well soon. I think the easiest way is we close this now and then later you can simply click reopen on this existing tick... [16:00:45] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:00:51] RECOVERY - Host db2096.mgmt is UP: PING OK - Packet loss = 0%, RTA = 43.30 ms [16:01:19] RECOVERY - Host rdb2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 43.95 ms [16:01:41] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10Dzahn) 05In progress→03Declined Don't take the "declined" too literal. The expection is that you just change it to "open" again whenever you like. [16:02:29] RECOVERY - Host wcqs2001 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms [16:03:21] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:03:39] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:04:35] RECOVERY - Host restbase2024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.57 ms [16:04:45] RECOVERY - Host kubernetes2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [16:04:58] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10ERayfield) Ok, thanks - that sounds good to me Ellen [16:05:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:05:11] RECOVERY - Host mc2023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.45 ms [16:05:25] RECOVERY - Host wcqs2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.00 ms [16:05:37] RECOVERY - Host aqs2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.86 ms [16:06:27] RECOVERY - Host aqs2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [16:06:27] RECOVERY - Host aqs2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [16:06:28] RECOVERY - Host aqs2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.34 ms [16:07:05] RECOVERY - Host db2161.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [16:07:07] RECOVERY - Host db2162.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [16:07:41] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [16:08:03] !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs[2005-2008].codfw.wmnet [16:08:04] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs[2005-2008].codfw.wmnet [16:08:08] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for 15 hosts [16:08:13] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 15 hosts [16:10:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:07] PROBLEM - Host ms-be2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:11:27] !log jelto@cumin1001 START - Cookbook sre.hosts.remove-downtime for 12 hosts [16:11:31] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 12 hosts [16:14:17] PROBLEM - IPMI Sensor Status on mw2322 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:15:11] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:15:59] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:16:25] PROBLEM - Host furud.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:17:11] PROBLEM - Host thanos-be2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:17:25] PROBLEM - Host ms-be2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:19:19] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:20:43] PROBLEM - Host elastic2080.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:20:55] PROBLEM - Host mc2046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:21:17] PROBLEM - Host elastic2043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:21:17] PROBLEM - Host elastic2044.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:21:21] PROBLEM - Host elastic2079.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:24:40] (03PS1) 10Milimetric: role::common::aqs: update mw history [puppet] - 10https://gerrit.wikimedia.org/r/820160 [16:25:12] (03CR) 10Milimetric: "https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Deploy_new_History_snapshot_for_Wikistats_Backend" [puppet] - 10https://gerrit.wikimedia.org/r/820160 (owner: 10Milimetric) [16:25:46] (03CR) 10Btullis: [C: 03+2] role::common::aqs: update mw history [puppet] - 10https://gerrit.wikimedia.org/r/820160 (owner: 10Milimetric) [16:26:55] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [16:27:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10RobH) a:05RobH→03Jclark-ctr [16:27:41] RECOVERY - Host elastic2044.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 40.55 ms [16:27:58] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes[2009-2010,2020].codfw.wmnet [16:27:59] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2009-2010,2020].codfw.wmnet [16:28:35] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [16:28:53] PROBLEM - IPMI Sensor Status on wcqs2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:30:05] RECOVERY - Host thanos-be2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [16:30:21] RECOVERY - Host ms-be2047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.58 ms [16:30:33] RECOVERY - Host ms-be2033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.47 ms [16:30:34] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for rdb2008.codfw.wmnet [16:30:35] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for rdb2008.codfw.wmnet [16:32:01] !log power off mc2025-2026 [16:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:15] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7285 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:33:49] RECOVERY - Host mc2046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.66 ms [16:34:11] RECOVERY - Host elastic2043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [16:34:13] RECOVERY - Host elastic2079.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [16:34:13] RECOVERY - Host elastic2080.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [16:34:33] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:34:39] RECOVERY - Host furud.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.05 ms [16:34:39] PROBLEM - Host mc2025 is DOWN: PING CRITICAL - Packet loss = 100% [16:34:39] PROBLEM - Host mc2026 is DOWN: PING CRITICAL - Packet loss = 100% [16:35:14] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on mc[2025-2026].codfw.wmnet with reason: PDU swap [16:35:17] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [16:35:28] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mc[2025-2026].codfw.wmnet with reason: PDU swap [16:36:53] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1021 mgmt flapping - https://phabricator.wikimedia.org/T314413 (10Dzahn) "flapping mgmt" in Icinga has been reported as succesfully fixed on other hosts through either "reset DRAC" and/or "upgrade DRAC / firmware". also see: T304289 (maybe this sh... [16:37:42] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on gitlab-runner2002.codfw.wmnet with reason: PDU swap [16:37:49] 10SRE, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10Dzahn) [16:37:51] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1021 mgmt flapping - https://phabricator.wikimedia.org/T314413 (10Dzahn) [16:37:56] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on gitlab-runner2002.codfw.wmnet with reason: PDU swap [16:38:34] !log jelto@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc2023.codfw.wmnet [16:38:34] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2023.codfw.wmnet [16:39:20] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for 10 hosts [16:39:24] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 10 hosts [16:40:12] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc2046.codfw.wmnet [16:40:12] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2046.codfw.wmnet [16:42:15] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1044 site=eqiad tunnel=mc2026_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [16:42:27] PROBLEM - Host db2164.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:45:45] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:46:01] PROBLEM - Host mc2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:46:01] PROBLEM - Host mc2025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:46:09] !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet [16:46:10] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet [16:46:57] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:47:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: in setup / flapping [16:47:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: PDU work [16:47:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: in setup / flapping [16:47:44] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: PDU work [16:47:49] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:47:50] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0444b6dc-d394-43ed-8847-01dae0f308ee) set by mvernon@cumin1001 for 1 day, 0:00:00 on 8 host(s) and their services with rea... [16:48:09] PROBLEM - Host elastic2030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:48:09] PROBLEM - Host gitlab-runner2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:48:11] PROBLEM - Host es2029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:48:11] PROBLEM - Host es2030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:48:20] !log shutdown moss-fe2001.codfw.wmnet,ms-fe2011.codfw.wmnet,ms-be20[34,35,42,48,55,68].codfw.wmnet PDU work T310145 [16:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:23] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [16:49:03] PROBLEM - Host kubestage2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:50:51] PROBLEM - Host wdqs2007 is DOWN: PING CRITICAL - Packet loss = 100% [16:51:15] PROBLEM - Host db2148.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:51:19] PROBLEM - Host db2163.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:51:27] PROBLEM - Host parse2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:51:29] PROBLEM - Host parse2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:51:31] PROBLEM - Host ganeti2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:51:31] PROBLEM - Host ganeti2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:52:01] PROBLEM - Host es2025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:52:09] PROBLEM - Host ps1-b8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:53:05] PROBLEM - Host elastic2029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:54:01] PROBLEM - Host parse2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:54:45] PROBLEM - Juniper alarms on asw-b-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:54:51] PROBLEM - Host restbase2014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:56:03] PROBLEM - Host wdqs2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:56:34] (03PS1) 10AOkoth: gitlab: copy ssh host keys for failover [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713) [16:58:46] (03PS1) 10Jgreen: Remove frauth1001 from Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/820164 (https://phabricator.wikimedia.org/T299068) [16:59:56] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1001/36611/" [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [16:59:58] (03PS1) 10Ladsgroup: auto_schema: Change replica_set to be all replicas in all dcs [software] - 10https://gerrit.wikimedia.org/r/820165 (https://phabricator.wikimedia.org/T314486) [17:00:15] RECOVERY - IPMI Sensor Status on wcqs2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:00:29] (03PS6) 10Ayounsi: sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 [17:00:37] !log btullis@cumin1001 END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [17:01:21] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7398 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:03:13] (03PS1) 10Btullis: Revert the AQS mediawiki history change [puppet] - 10https://gerrit.wikimedia.org/r/820167 [17:04:49] PROBLEM - Host elastic2031 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:31] (03CR) 10Btullis: [C: 03+2] Revert the AQS mediawiki history change [puppet] - 10https://gerrit.wikimedia.org/r/820167 (owner: 10Btullis) [17:06:55] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=(kubernetes2020.codfw.wmnet|kubernetes2009.codfw.wmnet|kubernetes2010.codfw.wmnet) [17:07:55] PROBLEM - IPMI Sensor Status on ganeti2019 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:08:07] PROBLEM - Host gitlab-runner2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:08:25] !log T310145 `elastic2031` and `wcqs2002` powered off in preparation for C1 maintenance [17:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:28] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [17:08:49] RECOVERY - Juniper alarms on asw-b-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:09:21] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10RKemper) [17:10:15] PROBLEM - Host wcqs2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:55] RECOVERY - Host parse2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [17:10:55] RECOVERY - Host parse2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.20 ms [17:10:57] RECOVERY - Host ganeti2019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.99 ms [17:10:57] RECOVERY - Host ganeti2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.26 ms [17:11:27] RECOVERY - Host es2025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.88 ms [17:12:29] RECOVERY - Host elastic2029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [17:13:29] RECOVERY - Host parse2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [17:14:15] RECOVERY - Host restbase2014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [17:14:23] !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for 6 hosts [17:14:25] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 6 hosts [17:14:33] RECOVERY - Host wdqs2007 is UP: PING OK - Packet loss = 0%, RTA = 31.71 ms [17:14:47] RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:15:31] RECOVERY - Host wdqs2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [17:17:13] RECOVERY - Host db2148.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [17:17:19] RECOVERY - Host db2163.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [17:18:31] RECOVERY - Host mc2026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms [17:18:31] RECOVERY - Host mc2025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms [17:18:59] PROBLEM - IPMI Sensor Status on ganeti2020 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:19:29] RECOVERY - Host mc2025 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [17:20:35] RECOVERY - Host gitlab-runner2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.96 ms [17:20:35] RECOVERY - Host elastic2030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms [17:20:41] RECOVERY - Host es2029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.62 ms [17:20:41] RECOVERY - Host es2030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 50.99 ms [17:21:31] RECOVERY - Host db2164.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.55 ms [17:21:33] RECOVERY - Host kubestage2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.58 ms [17:23:35] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase20[12]4.codfw.wmnet [17:23:37] RECOVERY - Host gitlab-runner2002 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [17:24:01] RECOVERY - Host mc2026 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [17:24:20] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10hnowlan) [17:24:29] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [17:25:45] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:26:17] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:28:05] PROBLEM - Host mc2027 is DOWN: PING CRITICAL - Packet loss = 100% [17:30:01] RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [17:30:17] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:32:57] (Traffic bill over quota) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:33:11] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:33:33] PROBLEM - HP RAID on ms-be2035 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:33:36] ACKNOWLEDGEMENT - HP RAID on ms-be2035 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T314509 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gat [17:33:40] 10SRE, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10ops-monitoring-bot) [17:36:39] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7074 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:37:39] PROBLEM - Host es2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:37:45] PROBLEM - Host kubernetes2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:37:58] (Traffic bill over quota) firing: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:38:12] !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse[2008-2010].codfw.wmnet [17:38:13] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2008-2010].codfw.wmnet [17:39:15] RECOVERY - IPMI Sensor Status on ganeti2019 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:39:19] PROBLEM - Host es2032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:39:29] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:39:37] PROBLEM - Host db2138.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:39:39] PROBLEM - Host kubernetes2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:40:41] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7104 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:41:35] PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:42:49] PROBLEM - Host db2112.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:06] (03CR) 10Andrew Bogott: [C: 03+2] extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [17:43:23] PROBLEM - Host mc2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:31] PROBLEM - Host mc2037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:21] PROBLEM - Host cumin2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:37] PROBLEM - Host elastic2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:54] (03PS1) 10Dduvall: phabricator: Provide script for ensuring correct config file ownership [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) [17:46:21] PROBLEM - Host ores2005 is DOWN: PING CRITICAL - Packet loss = 100% [17:46:33] PROBLEM - Host cloudcontrol2003-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:46:41] PROBLEM - Host wcqs2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:47:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:47:49] PROBLEM - Host ganeti2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:47:49] PROBLEM - Host ganeti2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:47:55] PROBLEM - IPMI Sensor Status on ml-serve2006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:48:27] (03CR) 10CI reject: [V: 04-1] phabricator: Provide script for ensuring correct config file ownership [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [17:48:33] (03PS3) 10Ebernhardson: Change CirrusSearchElasticaWrite partitioning key [deployment-charts] - 10https://gerrit.wikimedia.org/r/819752 (https://phabricator.wikimedia.org/T314426) [17:48:55] PROBLEM - Host db2149.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:50:17] RECOVERY - IPMI Sensor Status on ganeti2020 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:50:32] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: name=kubestage2002.codfw.wmnet [17:50:33] PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:51:15] PROBLEM - Host ores2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:43] PROBLEM - Host pc2013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:51] RECOVERY - k8s requests count to the API on ml-serve-ctrl2001 is OK: (C)100 ge (W)50 ge 35.77 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [17:52:17] PROBLEM - Host restbase2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:21] PROBLEM - Host restbase2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:31] PROBLEM - Host scs-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:52:58] (Traffic bill over quota) firing: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:53:56] (03CR) 10Cwhite: [C: 03+1] "Best to move this forward prior to the weekly index rollover." [puppet] - 10https://gerrit.wikimedia.org/r/818516 (owner: 10Cwhite) [17:54:29] (03PS1) 10Ebernhardson: Add explicit partitioning key to ElasticaWrite [extensions/CirrusSearch] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/820190 (https://phabricator.wikimedia.org/T314426) [17:55:05] (03PS1) 10Ebernhardson: Add explicit partitioning key to ElasticaWrite [extensions/CirrusSearch] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820191 (https://phabricator.wikimedia.org/T314426) [17:55:31] !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2055.codfw.wmnet [17:55:31] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2055.codfw.wmnet [17:55:39] !log increasing partitions from 5 to 6 for *.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite topics in Kafka main-eqiad and main-codfw - T314426 [17:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:47] T314426: Job queue for writes to cloudelastic falling behind - https://phabricator.wikimedia.org/T314426 [17:56:55] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for elastic2043.codfw.wmnet [17:56:56] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic2043.codfw.wmnet [17:57:05] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for elastic2044.codfw.wmnet [17:57:06] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic2044.codfw.wmnet [17:57:29] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:57:32] (03PS2) 10Dduvall: phabricator: Provide script for ensuring correct config file ownership [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) [17:57:35] RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:57:38] !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc[2025-2026].codfw.wmnet [17:57:38] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2025-2026].codfw.wmnet [17:57:58] (Traffic bill over quota) resolved: Alert for device cr3-knams.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:58:02] !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubestage2002.codfw.wmnet [17:58:03] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubestage2002.codfw.wmnet [17:58:43] RECOVERY - Host restbase2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.72 ms [17:58:45] RECOVERY - Host es2032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms [17:58:45] RECOVERY - Host es2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.13 ms [17:59:03] RECOVERY - Host db2138.mgmt is UP: PING OK - Packet loss = 0%, RTA = 47.58 ms [17:59:05] RECOVERY - Host kubernetes2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.48 ms [17:59:05] RECOVERY - Host kubernetes2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 72.39 ms [17:59:51] RECOVERY - Host ganeti2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms [17:59:51] RECOVERY - Host ganeti2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.21 ms [18:00:05] dancy and brennen: That opportune time is upon us again. Time for a Train log triage with CPT deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T1800). [18:00:05] dancy and brennen: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T1800). [18:00:10] o/ [18:00:25] o/ [18:00:44] Preparing to press the button [18:00:49] RECOVERY - Host db2149.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.84 ms [18:00:49] RECOVERY - Host wcqs2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.56 ms [18:01:05] (03CR) 10Jgreen: [C: 03+2] Remove frauth1001 from Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/820164 (https://phabricator.wikimedia.org/T299068) (owner: 10Jgreen) [18:01:33] RECOVERY - Host mc2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms [18:01:39] RECOVERY - Host mc2037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.94 ms [18:02:16] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1021 mgmt flapping - https://phabricator.wikimedia.org/T314413 (10wiki_willy) a:03Cmjohnson [18:02:33] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 108.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [18:03:13] (03CR) 10Dduvall: "Note I added a new script instead of modifying phab_deploy_finalize so that this action could occur independent of the latter and directly" [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [18:03:19] RECOVERY - Host pc2013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [18:03:29] RECOVERY - Host cumin2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.12 ms [18:03:45] RECOVERY - Host elastic2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms [18:04:05] RECOVERY - Host scs-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.52 ms [18:04:31] RECOVERY - Host elastic2031 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [18:04:59] PROBLEM - Check systemd state on elastic2031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:05:09] RECOVERY - Host cloudcontrol2003-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.35 ms [18:05:11] RECOVERY - Host wcqs2002 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [18:05:19] RECOVERY - Host mc2027 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [18:06:17] RECOVERY - Host db2112.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.59 ms [18:06:29] PROBLEM - IPMI Sensor Status on ganeti2010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:07:05] brennen, dancy: note that power maintenance in codfw continues, you should be fine to deploy but don't be surprised when icinga is noisy, and you have to squint a little closer to watch for alerts you do care about [18:07:15] thx! [18:07:25] (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820176 (https://phabricator.wikimedia.org/T308076) [18:07:26] thanks rzl [18:07:27] RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:27] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820176 (https://phabricator.wikimedia.org/T308076) (owner: 10TrainBranchBot) [18:07:29] The button has been pressed [18:07:43] (in particular there are no app servers left today so scapping should be unaffected) [18:07:57] RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.73 ms [18:08:34] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820176 (https://phabricator.wikimedia.org/T308076) (owner: 10TrainBranchBot) [18:09:21] RECOVERY - Host restbase2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [18:09:31] PROBLEM - IPMI Sensor Status on mc2027 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:10:41] RECOVERY - Host ores2005 is UP: PING OK - Packet loss = 0%, RTA = 33.45 ms [18:10:49] PROBLEM - ores on ores2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [18:11:27] PROBLEM - ores_workers_running on ores2005 is CRITICAL: PROCS CRITICAL: 2 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [18:12:05] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7128 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:12:11] RECOVERY - ores on ores2005 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [18:12:37] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.23 refs T308076 [18:12:37] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:12:41] T308076: 1.39.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T308076 [18:12:57] RECOVERY - ores_workers_running on ores2005 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [18:14:07] RECOVERY - Host ores2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.90 ms [18:14:58] (KubernetesCalicoDown) firing: (2) kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:15:25] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:15:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:15:47] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:16:15] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.23 refs T308076 (duration: 03m 37s) [18:16:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:16:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:18:35] RECOVERY - Cassandra instance data free space on restbase1017 is OK: DISK OK - free space: /srv/cassandra/instance-data 11005 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:21:37] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:21:39] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [18:22:23] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:23:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:24:58] (KubernetesCalicoDown) resolved: (2) kubernetes2011.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:25:03] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7054 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:26:01] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:26:13] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:28:03] PROBLEM - Check systemd state on ml-serve1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:40] 10ops-eqiad, 10decommission-hardware: decommission frauth1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T314517 (10Jgreen) [18:31:46] RECOVERY - IPMI Sensor Status on mc2027 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:33:12] !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for mc[2027,2037].codfw.wmnet [18:33:13] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2027,2037].codfw.wmnet [18:35:34] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [18:36:44] RECOVERY - IPMI Sensor Status on ganeti2010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:37:16] RECOVERY - Check systemd state on ml-serve1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:38:08] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:38:26] (03PS2) 10Jcrespo: Revert "Setup temporary arrangement for codfw snapshots durin PDU maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/819794 [18:42:51] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/819794 (owner: 10Jcrespo) [18:44:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2159 db2143', diff saved to https://phabricator.wikimedia.org/P32243 and previous config saved to /var/cache/conftool/dbconfig/20220803-184432-marostegui.json [18:45:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [18:45:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [18:45:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [18:45:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [18:45:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [18:45:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [18:46:00] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:46:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T312972)', diff saved to https://phabricator.wikimedia.org/P32244 and previous config saved to /var/cache/conftool/dbconfig/20220803-184603-marostegui.json [18:46:08] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:48:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312972)', diff saved to https://phabricator.wikimedia.org/P32245 and previous config saved to /var/cache/conftool/dbconfig/20220803-184816-marostegui.json [18:51:28] Ci is taking 20 minutes to run [18:51:58] normally it takes 2-3 [18:52:11] (03CR) 10Jcrespo: [C: 03+2] Revert "Setup temporary arrangement for codfw snapshots durin PDU maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/819794 (owner: 10Jcrespo) [18:55:08] (03PS1) 10Ladsgroup: auto_schema: Add tests for replicas to pick [software] - 10https://gerrit.wikimedia.org/r/820181 (https://phabricator.wikimedia.org/T314486) [18:55:14] (03PS1) 10Cwhite: logstash: route k8s messages to k8s partition [puppet] - 10https://gerrit.wikimedia.org/r/820182 (https://phabricator.wikimedia.org/T314381) [18:56:13] !log rzl@deploy1002 conftool action : set/pooled=yes; selector: name=kubernetes2011.codfw.wmnet [18:56:29] !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes2011.codfw.wmnet [18:56:29] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2011.codfw.wmnet [18:56:39] (03PS2) 10Ladsgroup: auto_schema: Add tests for replicas to pick [software] - 10https://gerrit.wikimedia.org/r/820181 (https://phabricator.wikimedia.org/T299445) [18:58:10] (03CR) 10Marostegui: "I am thinking....how should we start treating the codfw master? We'll no longer be able to run schema changes directly on it, but how's au" [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [19:01:38] (03PS17) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [19:01:40] (03PS2) 10Jbond: peeringdb: refactor suggestions [software/spicerack] - 10https://gerrit.wikimedia.org/r/820103 [19:02:27] (03PS3) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/820103 [19:03:15] (03PS1) 10Dduvall: Ensure correct group ownership following config_deploy [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/820183 (https://phabricator.wikimedia.org/T313950) [19:03:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P32246 and previous config saved to /var/cache/conftool/dbconfig/20220803-190321-marostegui.json [19:04:25] (03CR) 10Dzahn: "I know I recommended rsync::quickdatacopy myself but this was assuming it was a new thing. Since there is already the existing code here t" [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [19:04:56] (03PS2) 10Dduvall: Ensure correct group ownership following config_deploy [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/820183 (https://phabricator.wikimedia.org/T313950) [19:05:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) [19:08:07] (03CR) 10Dzahn: [C: 03+1] "thanks for this!" [alerts] - 10https://gerrit.wikimedia.org/r/820072 (https://phabricator.wikimedia.org/T314353) (owner: 10Filippo Giunchedi) [19:10:12] (03PS1) 10Bartosz Dziewoński: Fix ReplyLinksController#teardown [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820192 [19:11:37] (03PS2) 10Brennen Bearnes: scap: stub out a checks.yaml [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/818231 (https://phabricator.wikimedia.org/T313953) [19:12:05] (03CR) 10CI reject: [V: 04-1] auto_schema: Add tests for replicas to pick [software] - 10https://gerrit.wikimedia.org/r/820181 (https://phabricator.wikimedia.org/T299445) (owner: 10Ladsgroup) [19:13:10] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7126 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:14:22] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7235 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:16:07] (03PS4) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/820103 [19:17:42] PROBLEM - Memcached on mc2038 is CRITICAL: connect to address 10.192.0.191 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [19:17:50] (03PS18) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [19:18:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P32247 and previous config saved to /var/cache/conftool/dbconfig/20220803-191828-marostegui.json [19:18:58] (03CR) 10Jbond: PeeringDB API: initial commit (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [19:19:39] (03Abandoned) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/820103 (owner: 10Jbond) [19:21:06] (03CR) 10Brennen Bearnes: [C: 03+1] "Seems right; https://gerrit.wikimedia.org/r/c/phabricator/deployment/+/818231 could probably be combined with this." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/820183 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [19:23:25] (03PS2) 10Xcollazo: airflow - Configure new platform_eng instance and rename old one as legacy. [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) [19:24:16] (03CR) 10CI reject: [V: 04-1] airflow - Configure new platform_eng instance and rename old one as legacy. [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [19:25:09] !log gerrit1001 - rsyncing /var/lib/gerrit/review_site/ over to gerrit2002 815401 [19:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:35] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) 05Open→03Resolved Disk replaced [19:26:14] (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [19:27:15] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Papaul) [19:33:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312972)', diff saved to https://phabricator.wikimedia.org/P32248 and previous config saved to /var/cache/conftool/dbconfig/20220803-193334-marostegui.json [19:33:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [19:33:38] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [19:33:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [19:33:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T312972)', diff saved to https://phabricator.wikimedia.org/P32249 and previous config saved to /var/cache/conftool/dbconfig/20220803-193354-marostegui.json [19:35:31] (03PS3) 10Xcollazo: airflow - Configure new platform_eng instance and rename old one as legacy. [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) [19:36:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312972)', diff saved to https://phabricator.wikimedia.org/P32250 and previous config saved to /var/cache/conftool/dbconfig/20220803-193607-marostegui.json [19:38:50] (03CR) 10Brennen Bearnes: "Sidebar to this patch, I just noticed:" [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [19:38:52] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - T314078 [19:38:55] T314078: Fix slow super_detect_noop code and monitor for future Elastic hangs - https://phabricator.wikimedia.org/T314078 [19:39:41] !log T314078 Rolling upgrade of codfw hosts; after this all of eqiad/codfw will have the new plugin version and we can resume the `search-loader` instances: `sudo -E cookbook sre.elasticsearch.rolling-operation search_codfw "codfw cluster plugin upgrade" --upgrade --nodes-per-run 3 --start-datetime 2022-08-03T19:38:10 --task-id T314078` [19:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:04] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7058 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:40:34] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:41] !log T314078 Forgot to mention, restart is at `ryankemper@cumin1001` tmux session `codfw_restarts` [19:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:34] (03CR) 10Xcollazo: airflow - Configure new platform_eng instance and rename old one as legacy. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [19:44:24] (03PS1) 10Dzahn: acme_chief: add gerrit-replica-new to SNI list [puppet] - 10https://gerrit.wikimedia.org/r/820185 (https://phabricator.wikimedia.org/T313250) [19:45:06] (03CR) 10Dzahn: [C: 03+2] acme_chief: add gerrit-replica-new to SNI list [puppet] - 10https://gerrit.wikimedia.org/r/820185 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [19:45:11] (03CR) 10Ahmon Dancy: [C: 03+1] acme_chief: add gerrit-replica-new to SNI list [puppet] - 10https://gerrit.wikimedia.org/r/820185 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [19:45:27] (03PS1) 10BryanDavis: striker: Remove uwsgi deployment logging config [puppet] - 10https://gerrit.wikimedia.org/r/820206 (https://phabricator.wikimedia.org/T306469) [19:45:31] (03PS1) 10BryanDavis: striker: remove from scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/820207 (https://phabricator.wikimedia.org/T306469) [19:46:46] RECOVERY - Host ps1-b3-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.78 ms [19:48:27] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/820182 (https://phabricator.wikimedia.org/T314381) (owner: 10Cwhite) [19:49:20] RECOVERY - Cassandra instance data free space on restbase1017 is OK: DISK OK - free space: /srv/cassandra/instance-data 11633 MB (32% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:51:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P32251 and previous config saved to /var/cache/conftool/dbconfig/20220803-195113-marostegui.json [19:52:18] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:52:52] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7363 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:55:00] (03CR) 10Dduvall: scap: stub out a checks.yaml (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/818231 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [19:55:42] (03PS5) 10Dzahn: gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) [19:55:44] (03CR) 10AOkoth: gitlab: copy ssh host keys for failover (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [19:58:59] (03PS3) 10Dduvall: scap: stub out a checks.yaml [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/818231 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [19:59:01] (03PS6) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) [19:59:03] (03PS3) 10Dduvall: Ensure correct group ownership following config_deploy [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/820183 (https://phabricator.wikimedia.org/T313950) [19:59:22] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:00:04] RoanKattouw, Urbanecm, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T2000). [20:00:04] zabe, danisztls, ebernhardson, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:06] !log rzl@deploy1002 conftool action : set/pooled=yes; selector: name=kubernetes2012.codfw.wmnet [20:00:12] !log rzl@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes2012.codfw.wmnet [20:00:13] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2012.codfw.wmnet [20:00:17] here ./ [20:00:20] \o [20:00:21] hi [20:00:26] I can deploy today [20:00:26] sry for not attending yesterday :) [20:00:34] no problem, it happens sometimes :) [20:00:38] hey :) [20:00:55] (03PS2) 10Urbanecm: Start writing to cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819765 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:01:07] (03CR) 10Urbanecm: [C: 03+2] Start writing to cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819765 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:01:14] hi zabe! let's start with you :) [20:01:22] (03CR) 10Urbanecm: [C: 03+2] Add explicit partitioning key to ElasticaWrite [extensions/CirrusSearch] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/820190 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson) [20:01:24] (03CR) 10Urbanecm: [C: 03+2] Add explicit partitioning key to ElasticaWrite [extensions/CirrusSearch] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820191 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson) [20:01:33] (03CR) 10Urbanecm: [C: 03+2] Fix ReplyLinksController#teardown [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820192 (owner: 10Bartosz Dziewoński) [20:01:46] (03PS1) 10RLazarus: Revert "Reducing codfw mobileapps during maintenance" [deployment-charts] - 10https://gerrit.wikimedia.org/r/820193 [20:01:48] (03PS3) 10Urbanecm: QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [20:02:48] (03Merged) 10jenkins-bot: Start writing to cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819765 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:03:30] zabe: mwdebug1001 has your patch! can you check it please? [20:03:48] (03CR) 10CI reject: [V: 04-1] QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [20:03:52] !log cwhite@cumin2002 START - Cookbook sre.ganeti.makevm for new host logstash2032.codfw.wmnet [20:03:54] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [20:03:59] (I recall last time it had some issues, but I don't recall which ones. it'd be great to test it doesn't happen now) [20:04:08] danisztls: your patch seems to fail CI. Can you please fix it? [20:04:19] urbanecm: yes [20:04:23] also, let me add you to the CI allowlist, so it runs automatically for you [20:04:34] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11543 MB (32% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:04:59] (03CR) 10RLazarus: [C: 03+2] Revert "Reducing codfw mobileapps during maintenance" [deployment-charts] - 10https://gerrit.wikimedia.org/r/820193 (owner: 10RLazarus) [20:05:15] urbanecm, did https://test.wikipedia.org/w/index.php?title=Test&type=revision&diff=519601&oldid=519540 Could you check whether cuc_actor got the correct value? [20:05:51] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/36612/" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:05:54] (03CR) 10Dzahn: [C: 03+2] gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:06:01] zabe: on it [20:06:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P32252 and previous config saved to /var/cache/conftool/dbconfig/20220803-200619-marostegui.json [20:06:38] urbanecm: thanks that will be super useful [20:07:17] !log gerrit - adding second replica T313250 [20:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:21] T313250: Bring up gerrit2002 - https://phabricator.wikimedia.org/T313250 [20:07:52] zabe: looks like it works. can you verify? https://www.irccloud.com/pastebin/EzG2l1Fg/ [20:08:07] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:08:07] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash2032.codfw.wmnet on all recursors [20:08:10] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash2032.codfw.wmnet on all recursors [20:08:22] (03PS1) 10Dduvall: phabricator: Stop managing /srv/phab/repos [puppet] - 10https://gerrit.wikimedia.org/r/820213 (https://phabricator.wikimedia.org/T313953) [20:08:32] urbanecm, yes looks good [20:08:58] zabe: excellent, syncing [20:08:58] danisztls: no problem. uploaded https://gerrit.wikimedia.org/r/820212 – once someone hits +2 on that, you'll be allowlisted. should be fairly quick :). [20:09:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:09:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:09:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:10:10] (03Merged) 10jenkins-bot: Revert "Reducing codfw mobileapps during maintenance" [deployment-charts] - 10https://gerrit.wikimedia.org/r/820193 (owner: 10RLazarus) [20:10:30] (03PS1) 10Bartosz Dziewoński: Update call to PageConfigFactory::create to use new signature [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820194 (https://phabricator.wikimedia.org/T314523) [20:10:40] (03PS3) 10Ladsgroup: auto_schema: Add tests for replicas to pick [software] - 10https://gerrit.wikimedia.org/r/820181 (https://phabricator.wikimedia.org/T299445) [20:10:44] (03PS4) 10DDesouza: QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) [20:10:56] (03CR) 10Andrew Bogott: [C: 03+2] striker: Remove uwsgi deployment logging config [puppet] - 10https://gerrit.wikimedia.org/r/820206 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [20:11:16] MatmaRex: i see you're uploading some other backports, do you want me to ship them as well? [20:11:20] i added one more patch to the window, i hope that's okay [20:11:23] yes [20:11:27] okay, let me +2 it [20:11:31] jouncebot: now [20:11:31] For the next 0 hour(s) and 48 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220803T2000) [20:11:33] funnily enough, the bug affects the Deployments page [20:11:38] (03CR) 10Urbanecm: [C: 03+2] Update call to PageConfigFactory::create to use new signature [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820194 (https://phabricator.wikimedia.org/T314523) (owner: 10Bartosz Dziewoński) [20:11:40] and i found it when adding my previous backports there [20:11:50] heh [20:12:18] urbanecm: should be good now [20:12:20] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:12:30] danisztls: thanks, checking [20:12:37] (03CR) 10Urbanecm: "rechecking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [20:12:40] well, letting CI to [20:12:45] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [20:12:45] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 195f8090b9694be65c937cea108ff4f6400972ec: Start writing to cuc_actor on test wikis (T233004) (duration: 03m 27s) [20:12:48] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [20:12:59] zabe: and your patch is live. please monitor the logs for a while :) [20:13:01] (03CR) 10Andrew Bogott: [C: 03+2] "moderately confused that this file is called 'kubernetes.yaml' but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/820207 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [20:13:04] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:13:14] sure, thanks :) [20:13:21] no problem [20:13:33] thanks for working on the migration! [20:13:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:13:47] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [20:14:08] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [20:14:30] (03CR) 10Andrew Bogott: [C: 03+1] wmcs: some yaml autoformatting [puppet] - 10https://gerrit.wikimedia.org/r/816005 (owner: 10David Caro) [20:15:22] RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:15:53] (03PS5) 10Urbanecm: QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [20:15:57] (03CR) 10Urbanecm: [C: 03+2] QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [20:16:15] (03PS1) 10Andrea Denisse: netmon: Open firewall port to connecto to the LibreNMS database. [puppet] - 10https://gerrit.wikimedia.org/r/820215 (https://phabricator.wikimedia.org/T309074) [20:16:18] danisztls: CI approved it (and it will run automatically on your next commits), so let's deploy and see :) [20:16:27] (will be available at beta within half an hour) [20:16:35] urbanecm: thanks! [20:17:19] np [20:17:23] * urbanecm waits on CI now for the backports [20:17:33] (03Merged) 10jenkins-bot: QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [20:17:55] (03PS1) 10Ladsgroup: auto_schema: Treat master of all dcs like a master of active dc [software] - 10https://gerrit.wikimedia.org/r/820216 (https://phabricator.wikimedia.org/T314486) [20:18:07] (03PS3) 10Dduvall: phabricator: Provide script for ensuring correct config file ownership [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) [20:18:09] (03PS4) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) [20:18:11] (03PS4) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche) [20:19:59] (03CR) 10Cwhite: [C: 03+1] "LGTM, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/820072 (https://phabricator.wikimedia.org/T314353) (owner: 10Filippo Giunchedi) [20:20:45] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/820072 (https://phabricator.wikimedia.org/T314353) (owner: 10Filippo Giunchedi) [20:20:53] (03Merged) 10jenkins-bot: Add explicit partitioning key to ElasticaWrite [extensions/CirrusSearch] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/820190 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson) [20:20:56] (03Merged) 10jenkins-bot: Add explicit partitioning key to ElasticaWrite [extensions/CirrusSearch] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820191 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson) [20:20:58] (03Merged) 10jenkins-bot: Fix ReplyLinksController#teardown [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820192 (owner: 10Bartosz Dziewoński) [20:21:04] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/819177 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [20:21:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312972)', diff saved to https://phabricator.wikimedia.org/P32253 and previous config saved to /var/cache/conftool/dbconfig/20220803-202125-marostegui.json [20:21:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [20:21:29] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [20:21:35] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/819179 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [20:21:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [20:21:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T312972)', diff saved to https://phabricator.wikimedia.org/P32254 and previous config saved to /var/cache/conftool/dbconfig/20220803-202146-marostegui.json [20:22:05] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/820215 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [20:22:32] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) [20:22:35] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) [20:22:51] ebernhardson: your patches are at mwdebug1001 if they can be tested [20:23:05] MatmaRex: your DiscussionTools patch is at mwdebug1001 as well, can you test it please? [20:23:34] urbanecm: hmm, nothing to really test. it all happens inside job queue stuff [20:23:38] looking [20:23:43] ebernhardson: ack, i'll sync it then [20:23:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:26] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:26:27] urbanecm: looks good [20:26:35] MatmaRex: thanks, i'll sync it [20:26:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T312972)', diff saved to https://phabricator.wikimedia.org/P32255 and previous config saved to /var/cache/conftool/dbconfig/20220803-202658-marostegui.json [20:27:02] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [20:27:28] (03PS5) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) [20:27:36] (03PS5) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche) [20:27:39] (03Merged) 10jenkins-bot: Update call to PageConfigFactory::create to use new signature [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820194 (https://phabricator.wikimedia.org/T314523) (owner: 10Bartosz Dziewoński) [20:27:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:27:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:28:07] !log cwhite@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host logstash2032.codfw.wmnet [20:28:15] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/CirrusSearch/: 9961e9bc8f5873f8ddc8a11108de0a7bfcb14ae6: Add explicit partitioning key to ElasticaWrite (T314426) (duration: 03m 23s) [20:28:18] T314426: Job queue for writes to cloudelastic falling behind - https://phabricator.wikimedia.org/T314426 [20:28:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:29:28] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:30:32] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [20:31:28] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/CirrusSearch/: 70a18f5846111a0dfe8ba473daf384cbb8e88804: Add explicit partitioning key to ElasticaWrite (T314426) (duration: 03m 13s) [20:31:37] ebernhardson: your patch is live [20:32:24] MatmaRex: your second patch is at mwdebug1001 now, please have a look (note wikitech doesn't understand X-Wikimedia-debug, so it has to be checked elsewhere) [20:32:32] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [20:32:40] PROBLEM - ps1-b7-codfw-infeed-load-tower-A-phase-Z on ps1-b7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:33:04] PROBLEM - ps1-b7-codfw-infeed-load-tower-B-phase-Y on ps1-b7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:33:15] (03CR) 10Brennen Bearnes: scap: stub out a checks.yaml (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/818231 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [20:33:20] urbanecm: oh hmm, i'm not sure if it's reproducible anywhere else now [20:33:28] PROBLEM - ps1-b7-codfw-infeed-load-tower-B-phase-Z on ps1-b7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:33:40] urbanecm: i'd need a private wiki on wmf.23 [20:33:44] PROBLEM - ps1-b7-codfw-infeed-load-tower-A-phase-X on ps1-b7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:33:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:34:40] PROBLEM - ps1-b7-codfw-infeed-load-tower-A-phase-Y on ps1-b7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:34:42] urbanecm: thanks [20:34:50] PROBLEM - ps1-b7-codfw-infeed-load-tower-B-phase-X on ps1-b7-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:34:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:34:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:35:02] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Open firewall port to connecto to the LibreNMS database. [puppet] - 10https://gerrit.wikimedia.org/r/820215 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [20:35:24] oh, officewiki is on wmf.23 [20:35:35] (03CR) 10Brennen Bearnes: scap: stub out a checks.yaml (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/818231 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [20:35:42] (03PS1) 10Cwhite: install_server: add logstash2032 provisioning data [puppet] - 10https://gerrit.wikimedia.org/r/820218 (https://phabricator.wikimedia.org/T313408) [20:35:44] (03PS1) 10Cwhite: hiera: add logstash2032 to cluster and set access [puppet] - 10https://gerrit.wikimedia.org/r/820219 (https://phabricator.wikimedia.org/T313408) [20:35:46] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:36:02] but… i can't reproduce the original issue on officewiki [20:36:25] urbanecm: so i think that due to some configuration i don't quite understand, wikitech might be the only affected wiki? [20:36:40] wikitech, and my localhost :) [20:36:40] MatmaRex: ah, okay. i'll just sync it to production then [20:36:48] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/DiscussionTools/: b840eef86837aed3e566885110e93b2ca9ab5f42: Fix ReplyLinksController#teardown (duration: 03m 27s) [20:37:02] (03PS1) 10Dduvall: devtools: Allow for scap deployment of scap [puppet] - 10https://gerrit.wikimedia.org/r/820220 (https://phabricator.wikimedia.org/T314195) [20:37:19] (03CR) 10Brennen Bearnes: [C: 03+1] phabricator: Stop managing /srv/phab/repos [puppet] - 10https://gerrit.wikimedia.org/r/820213 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [20:38:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:39:51] !log urbanecm@deploy1002 sync-file aborted: a804fe18f1e14795ba7836d3ebf6c361bb1538a7: Update call to PageConfigFactory::create to use new signature (T314523ú (duration: 00m 00s) [21:01:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:01:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:02:10] RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:24] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:02:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:04:40] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:08:02] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [21:12:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T312972)', diff saved to https://phabricator.wikimedia.org/P32258 and previous config saved to /var/cache/conftool/dbconfig/20220803-211216-marostegui.json [21:12:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [21:12:20] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [21:12:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [21:12:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32259 and previous config saved to /var/cache/conftool/dbconfig/20220803-211237-marostegui.json [21:14:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32260 and previous config saved to /var/cache/conftool/dbconfig/20220803-211449-marostegui.json [21:15:42] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:16:24] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:18:22] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7131 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [21:21:40] (03CR) 10Ryan Kemper: [C: 03+2] Change CirrusSearchElasticaWrite partitioning key [deployment-charts] - 10https://gerrit.wikimedia.org/r/819752 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson) [21:25:38] (03Merged) 10jenkins-bot: Change CirrusSearchElasticaWrite partitioning key [deployment-charts] - 10https://gerrit.wikimedia.org/r/819752 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson) [21:26:00] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:27:28] (03CR) 10Cwhite: [C: 03+2] hiera: add logstash2032 to cluster and set access [puppet] - 10https://gerrit.wikimedia.org/r/820219 (https://phabricator.wikimedia.org/T313408) (owner: 10Cwhite) [21:29:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P32261 and previous config saved to /var/cache/conftool/dbconfig/20220803-212955-marostegui.json [21:30:04] !log ryankemper@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [21:30:06] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7260 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [21:30:42] !log ryankemper@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [21:32:26] RECOVERY - Cassandra instance data free space on restbase1017 is OK: DISK OK - free space: /srv/cassandra/instance-data 11515 MB (32% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [21:35:51] !log ryankemper@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [21:35:54] !log ryankemper@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [21:36:19] ^ interesting, it doesn't seem to think there's changes in `eqiad` [21:37:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - T314078 [21:37:43] T314078: Fix slow super_detect_noop code and monitor for future Elastic hangs - https://phabricator.wikimedia.org/T314078 [21:45:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P32262 and previous config saved to /var/cache/conftool/dbconfig/20220803-214501-marostegui.json [21:46:48] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:53:46] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:57:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:00:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32263 and previous config saved to /var/cache/conftool/dbconfig/20220803-220007-marostegui.json [22:00:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [22:00:11] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [22:00:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [22:00:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:00:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:00:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T312972)', diff saved to https://phabricator.wikimedia.org/P32264 and previous config saved to /var/cache/conftool/dbconfig/20220803-220057-marostegui.json [22:02:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:03:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T312972)', diff saved to https://phabricator.wikimedia.org/P32265 and previous config saved to /var/cache/conftool/dbconfig/20220803-220309-marostegui.json [22:08:14] 10SRE, 10Observability-Logging, 10vm-requests: logstash collector nodes in codfw not row redundant - https://phabricator.wikimedia.org/T313408 (10colewhite) 05Open→03Resolved Host provisioned. [22:11:03] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:14:48] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Krinkle) >>! In T279664#8123041, @MatthewVernon wrote: > Are you proposing to do away with the concept of "active" DC, then? e.g. currently `swiftrepl` runs... [22:18:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P32266 and previous config saved to /var/cache/conftool/dbconfig/20220803-221815-marostegui.json [22:22:24] 10SRE, 10Observability-Logging, 10vm-requests, 10SRE Observability (FY2022/2023-Q1): logstash collector nodes in codfw not row redundant - https://phabricator.wikimedia.org/T313408 (10lmata) [22:25:27] (03PS1) 10BryanDavis: service::docker: Add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/820237 (https://phabricator.wikimedia.org/T306469) [22:25:29] (03PS1) 10BryanDavis: striker: route syslog output to ELK cluster via kafka [puppet] - 10https://gerrit.wikimedia.org/r/820238 (https://phabricator.wikimedia.org/T306469) [22:26:39] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:27:51] 10SRE, 10Icinga, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1): PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10lmata) [22:33:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P32267 and previous config saved to /var/cache/conftool/dbconfig/20220803-223321-marostegui.json [22:37:09] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) Ok, a bit more context about the maxage issue @Krinkle pointed out, plus possible solutions: The CDN expiry code in MediaWiki is s... [22:37:15] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:37:57] ^ Krinkle / bblack [22:38:04] T138093 I mean, not the etcd alert :) [22:38:05] T138093: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 [22:43:21] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) I suppose there's also a fourth option: special-case the parameter 'title' so that it always sorted into the first position. This w... [22:48:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T312972)', diff saved to https://phabricator.wikimedia.org/P32268 and previous config saved to /var/cache/conftool/dbconfig/20220803-224827-marostegui.json [22:48:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:48:34] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [22:48:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:49:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [22:49:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [22:49:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 9 hosts with reason: Maintenance [22:49:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 9 hosts with reason: Maintenance [22:49:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:50:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:50:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32269 and previous config saved to /var/cache/conftool/dbconfig/20220803-225015-marostegui.json [22:50:53] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:55:49] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:55:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:08:14] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10BBlack) I think your "2 followed by 3" approach makes the most pragmatic sense. We could/should probably eventually revisit the idea of... [23:27:49] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:32:27] (03CR) 10Papaul: [C: 03+2] Add new pdu model for pdu in rack b3 b6-b8 and c1 [puppet] - 10https://gerrit.wikimedia.org/r/820123 (https://phabricator.wikimedia.org/T310070) (owner: 10Papaul) [23:42:55] (Device rebooted) firing: Alert for device ps1-b3-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:47:55] (Device rebooted) resolved: Device ps1-b3-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:48:46] (Device rebooted) firing: Alert for device ps1-b6-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:49:14] RECOVERY - ps1-b7-codfw-infeed-load-tower-B-phase-Y on ps1-b7-codfw is OK: SNMP OK - ps1-b7-codfw-infeed-load-tower-B-phase-Y 83 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:49:16] RECOVERY - ps1-b7-codfw-infeed-load-tower-A-phase-Z on ps1-b7-codfw is OK: SNMP OK - ps1-b7-codfw-infeed-load-tower-A-phase-Z 315 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:49:16] RECOVERY - ps1-b7-codfw-infeed-load-tower-A-phase-Y on ps1-b7-codfw is OK: SNMP OK - ps1-b7-codfw-infeed-load-tower-A-phase-Y 44 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:49:16] RECOVERY - ps1-b7-codfw-infeed-load-tower-B-phase-X on ps1-b7-codfw is OK: SNMP OK - ps1-b7-codfw-infeed-load-tower-B-phase-X 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:49:16] RECOVERY - ps1-b7-codfw-infeed-load-tower-B-phase-Z on ps1-b7-codfw is OK: SNMP OK - ps1-b7-codfw-infeed-load-tower-B-phase-Z 263 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:50:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32270 and previous config saved to /var/cache/conftool/dbconfig/20220803-235030-marostegui.json [23:50:34] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10Krinkle) >>! In T138093#8129915, @ori wrote: > […] > > **Option 2: Include both the sorted and unsorted forms in the canonical purge UR... [23:50:35] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [23:53:46] (Device rebooted) resolved: Device ps1-b6-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:55:00] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:55:49] (03PS1) 10Krinkle: Remove redundant wgRC2UDPPrefix overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 [23:56:05] (03PS2) 10Krinkle: Remove legacy wgRC2UDPPrefix overrides for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 [23:56:26] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:56:48] (03CR) 10Krinkle: Remove legacy wgRC2UDPPrefix overrides for private wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 (owner: 10Krinkle) [23:56:59] jouncebot: now [23:56:59] No deployments scheduled for the next 6 hour(s) and 3 minute(s) [23:57:05] that's what I thought [23:57:47] (Device rebooted) firing: Alert for device ps1-b7-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:59:42] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1001.wikimedia.org with reason: service restart