[00:02:54] (03PS2) 10Brennen Bearnes: WIP: GitLab: enable container registry (experimental) [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) [00:24:27] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [00:25:11] (03CR) 10Cwhite: [C: 03+1] "One item inline, but otherwise LGTM. Adding traffic folks for visibility." [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [00:30:48] (03CR) 10Cwhite: prometheus::blackbox::check: add new blackbox exporter check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [00:43:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:50:09] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:21:00] (03PS1) 10Dzahn: create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 [01:23:23] (03CR) 10jerkins-bot: [V: 04-1] create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 (owner: 10Dzahn) [01:26:53] (03PS2) 10Dzahn: create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 [01:28:50] (03CR) 10jerkins-bot: [V: 04-1] create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 (owner: 10Dzahn) [01:31:37] RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:37] 10SRE, 10RESTBase-API, 10Traffic: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Legoktm) >>! In T307610#7918787, @Mitar wrote: >> Because our edge traffic code enforces a stricter limit of ~100/s (for responses that aren't frontend cache hits due to popularity)... [01:35:03] (03CR) 10Dzahn: "variants: contains a bad variant name" [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 (owner: 10Dzahn) [01:35:52] 10SRE, 10RESTBase-API, 10Traffic, 10Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Legoktm) [01:38:21] PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:39:25] (03PS3) 10Dzahn: create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 [01:41:34] !log gitlab2001 - starting backup-restore service that had failed on previous automatic run [01:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:51] RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:41] PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:00] ACKNOWLEDGEMENT - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service daniel_zahn https://phabricator.wikimedia.org/T308089 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:19] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:54:19] (03CR) 10Dzahn: [C: 03+2] create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 (owner: 10Dzahn) [02:19:21] PROBLEM - Check systemd state on backup2003 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:19] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:38:13] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:43:15] PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:43:49] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:43:51] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:43:52] PROBLEM - nova-compute proc minimum on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:43:55] PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:09] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:11] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:12] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:12] PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:13] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:17] RECOVERY - Check systemd state on backup2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:44:17] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:19] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:21] PROBLEM - nova-compute proc minimum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:23] PROBLEM - nova-compute proc minimum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:24] PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:27] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:31] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:32] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:44:37] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:45:13] PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:45:21] PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:45:22] PROBLEM - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:45:23] PROBLEM - nova-compute proc minimum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:45:35] PROBLEM - nova-compute proc minimum on cloudvirt1022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:45:41] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:45:49] PROBLEM - nova-compute proc minimum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:45:50] PROBLEM - nova-compute proc minimum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:45:55] PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:46:05] PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:46:09] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:46:10] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:48:27] RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:48:37] RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:49:01] RECOVERY - nova-compute proc minimum on cloudvirt1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:50:15] RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:51:11] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:51:12] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:51:19] PROBLEM - nova-compute proc minimum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:52:25] PROBLEM - nova-compute proc maximum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:52:31] PROBLEM - nova-compute proc maximum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:03] PROBLEM - nova-compute proc maximum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:04] PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:07] PROBLEM - nova-compute proc maximum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:07] PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:09] PROBLEM - nova-compute proc maximum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:19] PROBLEM - nova-compute proc maximum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:29] PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:33] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:34] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:37] RECOVERY - nova-compute proc minimum on cloudvirt1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:39] RECOVERY - nova-compute proc minimum on cloudvirt1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:39] RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:43] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:47] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:47] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:53:53] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:54:25] RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:54:27] PROBLEM - nova-compute proc maximum on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:54:33] RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:54:35] RECOVERY - nova-compute proc minimum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:54:36] RECOVERY - nova-compute proc minimum on cloudvirt1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:54:43] RECOVERY - nova-compute proc maximum on cloudvirt1028 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:54:49] RECOVERY - nova-compute proc minimum on cloudvirt1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:54:51] RECOVERY - nova-compute proc maximum on cloudvirt1032 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:54:53] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:03] RECOVERY - nova-compute proc minimum on cloudvirt1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:03] RECOVERY - nova-compute proc minimum on cloudvirt1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:07] RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:15] RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:23] RECOVERY - nova-compute proc maximum on cloudvirt1047 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:24] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:27] RECOVERY - nova-compute proc maximum on cloudvirt1024 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:27] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:28] RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:29] RECOVERY - nova-compute proc maximum on cloudvirt1027 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:29] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:30] RECOVERY - nova-compute proc minimum on cloudvirt1026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:41] RECOVERY - nova-compute proc maximum on cloudvirt1019 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:46] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:49] RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:51] RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:55:51] RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:56:45] RECOVERY - nova-compute proc maximum on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:57:37] RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:59:21] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:22:05] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:22:49] PROBLEM - Query Service HTTP Port on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:24:13] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.064 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:05] RECOVERY - Query Service HTTP Port on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:39:21] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:44:09] PROBLEM - ensure kvm processes are running on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [03:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:48:37] RECOVERY - ensure kvm processes are running on cloudvirt1021 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:02:07] PROBLEM - ensure kvm processes are running on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:07:41] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:11:50] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott needs a canary! https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:19:59] RECOVERY - SSH on furud.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:19:59] RECOVERY - ensure kvm processes are running on cloudvirt1021 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:21:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_main_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:43:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:01:47] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:08:49] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:13:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2146 T301879', diff saved to https://phabricator.wikimedia.org/P27778 and previous config saved to /var/cache/conftool/dbconfig/20220511-051307-marostegui.json [05:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:13] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [05:13:59] (03PS1) 10Marostegui: db2146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/790796 (https://phabricator.wikimedia.org/T301879) [05:14:45] (03CR) 10Marostegui: [C: 03+2] db2146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/790796 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [05:17:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2146 T301879', diff saved to https://phabricator.wikimedia.org/P27779 and previous config saved to /var/cache/conftool/dbconfig/20220511-051703-marostegui.json [05:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:22] (03PS1) 10Marostegui: Revert "db2146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/790440 [05:25:39] (03CR) 10Marostegui: [C: 03+2] Revert "db2146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/790440 (owner: 10Marostegui) [05:27:01] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [05:34:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1172 to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27780 and previous config saved to /var/cache/conftool/dbconfig/20220511-053418-marostegui.json [05:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:24] T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546 [05:35:12] (03CR) 10Marostegui: [C: 04-1] "Let's do it the other way around, let's drop it from the production databases before removing it from this file. I am setting it to -1 unt" [puppet] - 10https://gerrit.wikimedia.org/r/790708 (https://phabricator.wikimedia.org/T262978) (owner: 10Zabe) [05:49:18] 10SRE, 10conftool: Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Ladsgroup) [05:49:29] 10SRE, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Ladsgroup) [05:50:20] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) I don't think we should attempt this schema change again. There's not much b... [06:02:41] 10SRE, 10SRE-OnFire, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Marostegui) [06:14:52] (03PS1) 10Marostegui: db2146: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/790963 (https://phabricator.wikimedia.org/T308099) [06:16:16] (03CR) 10Marostegui: [C: 03+2] db2146: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/790963 (https://phabricator.wikimedia.org/T308099) (owner: 10Marostegui) [06:31:55] !log mwscript maintenance/refreshImageMetadata.php --wiki=commonswiki --force --verbose --mediatype=AUDIO --mime audio/webm (T226311) [06:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:01] T226311: Some WebM video files are misdetected as audio files due to the MIME detector not scanning enough bytes - https://phabricator.wikimedia.org/T226311 [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:40:20] !log db2146 set global innodb_max_dirty_pages_pct = 75; T307082 [06:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:25] T307082: Investigate spikes on db1132 (mariadb 10.6 host) - https://phabricator.wikimedia.org/T307082 [06:42:52] 10SRE, 10RESTBase-API, 10Traffic, 10Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Mitar) > Most of that is controlled by the SRE team at a level in front of the REST API, since the frontend caching layer is a shared resource across everything.... [06:53:00] (03PS1) 10Slyngshede: Move Wiki Rsync fetch jobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790967 (https://phabricator.wikimedia.org/T273673) [06:53:35] (03CR) 10jerkins-bot: [V: 04-1] Move Wiki Rsync fetch jobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790967 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [06:57:29] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:04] Amir1, awight, Urbanecm, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220511T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:05:18] !log updating ganeti4* to Ganeti 3.0.1-1~bpo10+1 T307997 [07:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:24] T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 [07:14:49] (03CR) 10MVernon: [C: 03+2] swift: drain ms-be1059, skip cluster-OK checks [puppet] - 10https://gerrit.wikimedia.org/r/790694 (https://phabricator.wikimedia.org/T307667) (owner: 10MVernon) [07:17:15] (03PS7) 10Slyngshede: OpenLDAP, move restart cronjob to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) [07:18:21] !log drain ganeti4001 T307997 [07:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:26] T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 [07:22:29] !log jmm@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti4004.ulsfo.wmnet with OS bullseye [07:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:34] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4004.ulsfo.wmnet with OS bullseye [07:32:14] (03PS2) 10Slyngshede: Move Wiki Rsync fetch jobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790967 (https://phabricator.wikimedia.org/T273673) [07:32:57] 10SRE, 10SRE-OnFire, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Joe) a:03Joe [07:42:52] (03PS1) 10Ayounsi: wmf-netbox: Netbox 3.2 compatibility [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/790975 (https://phabricator.wikimedia.org/T296452) [07:44:26] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4004.ulsfo.wmnet with reason: host reimage [07:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:37] (03PS1) 10Elukey: Set celery 5 settings for ores2009 [puppet] - 10https://gerrit.wikimedia.org/r/790978 (https://phabricator.wikimedia.org/T303801) [07:46:03] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2022-06-10 07:44:58 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [07:46:20] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2054.codfw.wmnet with OS bullseye [07:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:23] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2054.codfw.wmnet with OS bullseye [07:46:30] (03CR) 10Jaime Nuche: [C: 03+1] deployment_server: Add keyholder identity for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790455 (https://phabricator.wikimedia.org/T307351) (owner: 10RLazarus) [07:46:53] (03CR) 10Elukey: [C: 03+2] Set celery 5 settings for ores2009 [puppet] - 10https://gerrit.wikimedia.org/r/790978 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [07:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:47:25] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4004.ulsfo.wmnet with reason: host reimage [07:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:56] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ores2009.codfw.wmnet with OS buster [07:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [07:55:03] (03PS1) 10Giuseppe Lavagetto: Raise an error if wrong tags are used in a query. [software/conftool] - 10https://gerrit.wikimedia.org/r/790980 (https://phabricator.wikimedia.org/T308100) [07:57:35] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:59:13] 10SRE, 10Infrastructure-Foundations: Interactive firmware prompts on Bullseye with some Broadcom NICs - https://phabricator.wikimedia.org/T308106 (10MoritzMuehlenhoff) [07:59:22] 10SRE, 10Infrastructure-Foundations: Interactive firmware prompts on Bullseye with some Broadcom NICs - https://phabricator.wikimedia.org/T308106 (10MoritzMuehlenhoff) p:05Triageβ†’03Medium [07:59:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] Double the number of replicas of eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/790727 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [07:59:56] 10SRE, 10SRE-swift-storage, 10Commons: Server error 0 after uploading chunk - https://phabricator.wikimedia.org/T307874 (10Yann) 05Openβ†’03Resolved a:03Yann The issue may come from my Internet connection. I will repost after further tests. [08:00:45] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2054.codfw.wmnet with reason: host reimage [08:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:57] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4004.ulsfo.wmnet with OS bullseye [08:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:03] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4004.ulsfo.wmnet with OS bullseye completed: - ganeti4004 (**PASS**) - Down... [08:04:14] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2054.codfw.wmnet with reason: host reimage [08:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:06] (03PS1) 10AikoChou: ml-services: update articlequality image [deployment-charts] - 10https://gerrit.wikimedia.org/r/790983 (https://phabricator.wikimedia.org/T301766) [08:06:15] 10SRE, 10SRE-swift-storage, 10Commons: Server error 0 after uploading chunk - https://phabricator.wikimedia.org/T307874 (10Aklapper) 05Resolvedβ†’03Invalid [08:11:37] (03CR) 10Awight: "I'm not sure why a wmf.11 branch was made, probably just an automatic process. We can ignore because it will never be deployed: https://w" [extensions/FlaggedRevs] (wmf/1.39.0-wmf.11) - 10https://gerrit.wikimedia.org/r/790437 (https://phabricator.wikimedia.org/T307972) (owner: 10Thiemo Kreuz (WMDE)) [08:12:57] !log Rename revision_actor_temp on db1132 (s1) and db1114 (s8) T307906 [08:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:01] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [08:15:35] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2009.codfw.wmnet with reason: host reimage [08:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:08] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2054.codfw.wmnet with OS bullseye [08:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:12] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2054.codfw.wmnet with OS bullseye completed: - ms-be2054 (**PASS**) - Downtim... [08:19:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2009.codfw.wmnet with reason: host reimage [08:19:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:12] (03CR) 10Elukey: [C: 03+2] ml-services: update articlequality image [deployment-charts] - 10https://gerrit.wikimedia.org/r/790983 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [08:20:16] (03CR) 10Vgutierrez: requestctl: add AND NOT and OR NOT to the parsing grammar (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto) [08:25:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (didn't test it though)" [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:27:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/790773 (owner: 10Dwisehaupt) [08:40:23] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:46:21] !log logging an example as part of Simon's omboarding [08:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:22] (03CR) 10David Caro: [C: 03+2] "LGTM, though we should probably move this to a cookbook sooner than later." [puppet] - 10https://gerrit.wikimedia.org/r/790735 (owner: 10Majavah) [08:49:01] (03CR) 10David Caro: [C: 03+1] rake_modules: add check for spdk licence header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [08:50:00] (03CR) 10Slyngshede: [C: 03+2] OpenLDAP, move restart cronjob to systemd timer. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:50:43] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host ores2009.codfw.wmnet with OS buster [08:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:13] (03CR) 10Btullis: [C: 03+2] Double the number of replicas of eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/790727 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [08:58:11] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet [08:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1001.eqiad.wmnet [09:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:27] (03Merged) 10jenkins-bot: Double the number of replicas of eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/790727 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [09:04:37] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-cluster [09:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:22] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [09:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:06] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [09:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:52] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [09:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:35] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [09:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:20] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I have now deployed the change to double the number of replica pods for eventgate-an... [09:12:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [09:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:10] jouncebot: next [09:14:10] In 3 hour(s) and 45 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220511T1300) [09:15:37] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-cluster [09:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:23] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup2001.codfw.wmnet [09:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:11] (03CR) 10Kosta Harlan: Account creation: update live campaigns config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (owner: 10Sergio Gimeno) [09:21:29] (03PS1) 10Ayounsi: ganeti-netbox-sync: Add netbox 3.2 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790991 (https://phabricator.wikimedia.org/T296452) [09:22:01] (03PS2) 10Sergio Gimeno: Account creation: update live campaigns config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) [09:22:54] (03CR) 10Muehlenhoff: [C: 03+1] "Two typos inline, looks great!" [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [09:23:18] (03CR) 10Slyngshede: [C: 03+2] Convert dumps-status from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/789769 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:23:47] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1) [09:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:45] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2055.codfw.wmnet with OS bullseye [09:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:49] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2055.codfw.wmnet with OS bullseye [09:24:59] (03PS1) 10Jcrespo: BackupStatistics: Increase maximum backup time to a week [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/790992 [09:25:34] (03PS4) 10Samtar: changeprop: Remove RESTBase page blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/767878 (https://phabricator.wikimedia.org/T274359) [09:25:44] (03CR) 10Jcrespo: [C: 03+2] BackupStatistics: Increase maximum backup time to a week [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/790992 (owner: 10Jcrespo) [09:27:06] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup2001.codfw.wmnet [09:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:40] (03CR) 10Btullis: [C: 03+2] "I will deploy this change today. Thanks for the guidance jcrespo." [puppet] - 10https://gerrit.wikimedia.org/r/697992 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [09:27:54] !log systemctl reset-failed ifup@ens5.service on registry2003 - T273026 [09:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:58] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [09:30:51] hnowlan: reckon we might be able to get https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/767878 deployed today? [09:33:03] (03CR) 10Jcrespo: [C: 03+1] dumps-eqiad-analytics_meta.sql.erb - add grants for new airflow_analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697992 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [09:34:27] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for registry2003.codfw.wmnet [09:34:27] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for registry2003.codfw.wmnet [09:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:14] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup2002.codfw.wmnet [09:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:19] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry2004.codfw.wmnet [09:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:02] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2004.codfw.wmnet [09:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:15] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7916593, @cmooney wrote: >> If there is any kind of anycast with the k8s prefixes (same prefix adverti... [09:41:37] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup2002.codfw.wmnet [09:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:01] (03CR) 10Cathal Mooney: [C: 03+2] Add 'includes' in private address reverse zones for new subnets [dns] - 10https://gerrit.wikimedia.org/r/790744 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [09:42:22] PROBLEM - ganeti-noded running on ganeti4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:42:59] jouncebot: next [09:42:59] In 3 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220511T1300) [09:43:02] PROBLEM - ganeti-mond running on ganeti4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [09:43:55] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet [09:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:18] PROBLEM - ganeti-confd running on ganeti4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:44:26] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:48:22] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [09:49:01] (03PS1) 10Btullis: Add grants for new databases to be backed up on analytics-meta [puppet] - 10https://gerrit.wikimedia.org/r/790994 (https://phabricator.wikimedia.org/T308113) [09:50:21] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1004.eqiad.wmnet [09:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:24] (03PS1) 10Ladsgroup: Set dewiki to read new for templatelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790997 (https://phabricator.wikimedia.org/T306673) [12:35:29] (03CR) 10Marostegui: [C: 04-1] filtered_tables: remove flaggedpage_config.fpc_select (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790708 (https://phabricator.wikimedia.org/T262978) (owner: 10Zabe) [12:36:29] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [12:42:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1172 to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27786 and previous config saved to /var/cache/conftool/dbconfig/20220511-124226-marostegui.json [12:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:32] T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546 [12:43:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:44:13] (03CR) 10Yaron Koren: [C: 03+1] "Thanks! As you'd expect, this looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/790998 (owner: 10Hashar) [12:45:32] (03CR) 10Tacsipacsi: filtered_tables: remove flaggedpage_config.fpc_select (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790708 (https://phabricator.wikimedia.org/T262978) (owner: 10Zabe) [12:45:34] !log jmm@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti4001.ulsfo.wmnet with OS bullseye [12:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:39] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye [12:47:01] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) > Yup, it's there. Subtle but noticeable. Shaved off ~1s from p99 and ~80-100ms from... [12:50:22] (03PS3) 10Slyngshede: Move dumps exception checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/791015 (https://phabricator.wikimedia.org/T273673) [12:50:38] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2007.codfw.wmnet with reason: host reimage [12:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:28] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2007.codfw.wmnet with reason: host reimage [12:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:22] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I have stumbled upon this issue with HAProxy, which seems to fit some of the symptom... [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: Dear deployers, time to do the UTC afternoon backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220511T1300). [13:00:04] awight: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] I can self-deploy this one. [13:00:19] ok! [13:00:31] (03CR) 10Awight: [C: 03+2] "Deploying." [extensions/FlaggedRevs] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790436 (https://phabricator.wikimedia.org/T307972) (owner: 10Thiemo Kreuz (WMDE)) [13:01:03] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 105.17, 102.28, 94.08 https://wikitech.wikimedia.org/wiki/Swift [13:03:51] (03Merged) 10jenkins-bot: Fix incomplete FlaggedRevs::binaryFlagging() implementation [extensions/FlaggedRevs] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790436 (https://phabricator.wikimedia.org/T307972) (owner: 10Thiemo Kreuz (WMDE)) [13:04:32] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) [13:05:03] 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: +2 for esther-akinloose in Gerrit (mediawiki/extensions/VisualEditor) - https://phabricator.wikimedia.org/T305373 (10EAkinloose) I am good to go. Thanks @RLazarus ! {F35130227} [13:07:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:27] (03PS4) 10Kosta Harlan: Account creation: update live campaigns config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [13:08:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:08:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:20] (03PS1) 10Jelto: gitlab: fix regex for restore version check [puppet] - 10https://gerrit.wikimedia.org/r/791029 (https://phabricator.wikimedia.org/T274463) [13:11:28] !log awight@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/FlaggedRevs/backend/FlaggedRevs.php: Backport: [[gerrit:790436|Fix incomplete FlaggedRevs::binaryFlagging() implementation (T307972)]] (duration: 00m 51s) [13:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:33] T307972: Admins & Reviewers can't review revision - https://phabricator.wikimedia.org/T307972 [13:12:08] (03PS2) 10Jelto: gitlab: fix regex for restore version check [puppet] - 10https://gerrit.wikimedia.org/r/791029 (https://phabricator.wikimedia.org/T274463) [13:13:00] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:13:18] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35193/console" [puppet] - 10https://gerrit.wikimedia.org/r/791029 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [13:13:20] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti4001.ulsfo.wmnet with OS bullseye [13:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:25] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye executed with errors: - ganeti4001 (**FAIL**) - Removed from... [13:14:55] !log EU backports complete [13:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:39] PROBLEM - puppet last run on gitlab2001 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:17:41] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: fix regex for restore version check [puppet] - 10https://gerrit.wikimedia.org/r/791029 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [13:21:23] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 105.18, 100.83, 99.23 https://wikitech.wikimedia.org/wiki/Swift [13:21:24] (03PS1) 10Majavah: wmcs: toolforge: grid: add a cookbook to reboot a grid queue [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/791030 [13:23:05] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) Just pasting this for the record here with one other test I have done on db2... [13:24:08] (03PS2) 10Majavah: wmcs: toolforge: grid: add a cookbook to reboot a grid queue [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/791030 [13:25:37] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores2007.codfw.wmnet with OS buster [13:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:09] (03PS1) 10Majavah: P:wmcs::prometheus: increase openstack-exporter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/791033 (https://phabricator.wikimedia.org/T302178) [13:28:03] (03CR) 10David Caro: [C: 03+2] P:wmcs::prometheus: increase openstack-exporter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/791033 (https://phabricator.wikimedia.org/T302178) (owner: 10Majavah) [13:28:15] (03PS1) 10Klausman: hiera: Use celery v5 on ores2008 [puppet] - 10https://gerrit.wikimedia.org/r/791034 (https://phabricator.wikimedia.org/T303801) [13:28:23] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [13:28:41] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35194/console" [puppet] - 10https://gerrit.wikimedia.org/r/791033 (https://phabricator.wikimedia.org/T302178) (owner: 10Majavah) [13:29:50] (03CR) 10Klausman: [C: 03+2] hiera: Use celery v5 on ores2008 [puppet] - 10https://gerrit.wikimedia.org/r/791034 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [13:30:17] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ores2008.codfw.wmnet with OS buster [13:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:29] (03CR) 10Majavah: "not sure if this is needed after https://gerrit.wikimedia.org/r/c/operations/puppet/+/791033/?" [puppet] - 10https://gerrit.wikimedia.org/r/779515 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [13:33:30] (03PS20) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [13:33:42] (03CR) 10Ottomata: "Thank you Ben!" [puppet] - 10https://gerrit.wikimedia.org/r/697992 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [13:34:23] (03CR) 10Ottomata: Enable basic monitoring of the airflow services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [13:34:33] (03CR) 10Ottomata: "One nit, +1 otherwise (or either way)" [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [13:34:35] (03CR) 10jerkins-bot: [V: 04-1] rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [13:34:42] (03CR) 10Ottomata: [C: 03+1] Enable basic monitoring of the airflow services [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [13:36:51] (03PS2) 10Btullis: Enable basic monitoring of the airflow services [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) [13:37:35] (03CR) 10Jbond: rake: Add new rake task to convert a module to SPDX (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [13:40:28] (03CR) 10Muehlenhoff: rake: Add new rake task to convert a module to SPDX (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [13:41:09] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:41:34] (03CR) 10Btullis: Enable basic monitoring of the airflow services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [13:41:42] (03PS21) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [13:42:21] RECOVERY - puppet last run on gitlab2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:42:38] (03CR) 10Ottomata: [C: 03+1] Move Wiki Rsync fetch jobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790967 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:44:59] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Vgutierrez) Beginning with HAProxy 2.1 HTX is the only way to go. On another issue (https://g... [13:45:31] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:54:31] (03PS1) 10Slyngshede: Move tilerator regeneration from crontab to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/791035 (https://phabricator.wikimedia.org/T273673) [13:54:54] !log jmm@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti4001.ulsfo.wmnet with OS bullseye [13:54:58] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye [13:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:17] (03CR) 10jerkins-bot: [V: 04-1] Move tilerator regeneration from crontab to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/791035 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:55:33] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2008.codfw.wmnet with reason: host reimage [13:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:18] (03PS3) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 [13:58:37] (03PS2) 10Slyngshede: Move tilerator regeneration from crontab to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/791035 (https://phabricator.wikimedia.org/T273673) [13:58:57] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2008.codfw.wmnet with reason: host reimage [13:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:05] (03PS4) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 [13:59:32] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35196/console" [puppet] - 10https://gerrit.wikimedia.org/r/791035 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [14:05:17] (03PS1) 10Elukey: Add Aiko and Kevin to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/791036 (https://phabricator.wikimedia.org/T307927) [14:07:01] (03CR) 10CDanis: [C: 03+1] Raise an error if wrong tags are used in a query. [software/conftool] - 10https://gerrit.wikimedia.org/r/790980 (https://phabricator.wikimedia.org/T308100) (owner: 10Giuseppe Lavagetto) [14:08:30] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:40] (03CR) 10Klausman: [C: 03+1] "In an ideal world, we would be able to separate/isolate secrets for different deployment destinations (i.e. ML k8s vs others) from each ot" [puppet] - 10https://gerrit.wikimedia.org/r/791036 (https://phabricator.wikimedia.org/T307927) (owner: 10Elukey) [14:12:53] (03PS4) 10Slyngshede: Move dumps exception checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/791015 (https://phabricator.wikimedia.org/T273673) [14:15:25] (03Abandoned) 10Slyngshede: Move dumps exception checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/791015 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [14:22:40] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti4001.ulsfo.wmnet with OS bullseye [14:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:45] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye executed with errors: - ganeti4001 (**FAIL**) - Removed from... [14:25:54] (03PS5) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 [14:29:40] (03CR) 10Volans: [C: 03+1] "Looks sane to me, I didn't test it though." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 (owner: 10Jbond) [14:31:25] (03PS6) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 [14:32:44] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores2008.codfw.wmnet with OS buster [14:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:48] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: wait reboot time timeout on aqs nodes - https://phabricator.wikimedia.org/T307260 (10Papaul) @Volans the only reason i see is the size of the disks and number of disks. We are using software RAID on 8x ~2TB disks [14:36:46] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:40:08] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:42:24] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:45:18] 10SRE, 10serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397 (10Tsevener) Hi folks - automatic updates are going out at 100% now. Hopefully the load is looking okay on your end. Thanks! [14:48:47] (03CR) 10Jbond: [C: 04-1] "This is manually deployed on netbox-dev2002, setting to -1 until we have rebuild the netbox frontends" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 (owner: 10Jbond) [14:51:35] !log depool ats-be on cp4032 [14:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:02] (03PS7) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 [14:58:14] !log installing qemu security updates on bullseye [14:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:19] !log pool ats-be on cp4032 [15:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1172 to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27789 and previous config saved to /var/cache/conftool/dbconfig/20220511-150038-marostegui.json [15:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:44] T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546 [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:04:45] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) > Even in the legacy setup (pre row e/f) adding new nodes requires manual error-prone gerrit changes like this one 35b0... [15:13:23] (03CR) 10CDanis: "Overall LGTM, just one thought" [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto) [15:14:42] (03PS21) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [15:15:43] (03CR) 10CDanis: [C: 03+1] requestctl: add "find" command [software/conftool] - 10https://gerrit.wikimedia.org/r/790712 (https://phabricator.wikimedia.org/T305638) (owner: 10Giuseppe Lavagetto) [15:17:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [15:18:14] (03CR) 10CDanis: [C: 03+1] requestctl: add retry-after request header when applicable [software/conftool] - 10https://gerrit.wikimedia.org/r/791006 (https://phabricator.wikimedia.org/T305824) (owner: 10Giuseppe Lavagetto) [15:18:24] (03CR) 10Legoktm: [C: 04-1] "Yay!" [puppet] - 10https://gerrit.wikimedia.org/r/790998 (owner: 10Hashar) [15:25:08] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@378e7ca]: (no justification provided) [15:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:17] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@378e7ca]: (no justification provided) (duration: 00m 08s) [15:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:59] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/790761 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [15:45:55] (03CR) 10Volans: "Have you considered using SREBatchBase/SREBatchRunnerBase instead? If they don't work for this specific use case John and I would like to " [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [15:46:28] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@378e7ca]: (no justification provided) [15:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:32] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@378e7ca]: (no justification provided) (duration: 00m 03s) [15:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:49:45] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti4001.mgmt.ulsfo.wmnet with reboot policy FORCED [15:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:03] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4001.mgmt.ulsfo.wmnet with reboot policy FORCED [15:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:25] (03CR) 10Btullis: [C: 03+2] Enable basic monitoring of the airflow services [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [15:53:39] (03CR) 10Volans: Add a cookbook for rolling reboot of k8s clusters (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [15:58:08] (03PS1) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/791050 (https://phabricator.wikimedia.org/T299797) [16:00:45] (03CR) 10Hnowlan: [C: 03+2] Update copyrights. [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/790645 (https://phabricator.wikimedia.org/T307398) (owner: 10Roman Stolar) [16:02:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/791050 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [16:02:46] (03Merged) 10jenkins-bot: Update copyrights. [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/790645 (https://phabricator.wikimedia.org/T307398) (owner: 10Roman Stolar) [16:09:42] 10SRE, 10conftool, 10Patch-For-Review: Provide a meaningful Retry-After value - https://phabricator.wikimedia.org/T305824 (10Joe) a:03Joe [16:10:53] (03PS1) 10KartikMistry: Update cxserver to 2022-05-11-135122-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/791052 (https://phabricator.wikimedia.org/T307967) [16:12:10] (03CR) 10Hashar: planet: add between the brackets podcast (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790998 (owner: 10Hashar) [16:12:28] (03PS2) 10Hashar: planet: add between the brackets podcast [puppet] - 10https://gerrit.wikimedia.org/r/790998 [16:23:45] 10SRE, 10serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397 (10akosiaris) Thanks for the update. Load has somewhat increased on our side, albeit minimally. [16:28:53] PROBLEM - Checks that the airflow database for airflow research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:29:22] (03PS7) 10Giuseppe Lavagetto: requestctl: add AND NOT and OR NOT to the parsing grammar [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) [16:29:24] (03PS3) 10Giuseppe Lavagetto: requestctl: add "find" command [software/conftool] - 10https://gerrit.wikimedia.org/r/790712 (https://phabricator.wikimedia.org/T305638) [16:29:26] (03PS2) 10Giuseppe Lavagetto: requestctl: add retry-after request header when applicable [software/conftool] - 10https://gerrit.wikimedia.org/r/791006 (https://phabricator.wikimedia.org/T305824) [16:31:33] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:31:35] PROBLEM - Checks that the airflow database for airflow analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:34:15] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:34:49] (03CR) 10Thcipriani: [C: 03+1] "πŸŽ‰" [puppet] - 10https://gerrit.wikimedia.org/r/790998 (owner: 10Hashar) [16:39:07] (03CR) 10CDanis: [C: 03+1] requestctl: add AND NOT and OR NOT to the parsing grammar [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto) [16:43:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:56:20] (03PS2) 10DLynch: Release DiscussionTools new topic tool to former a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790395 (https://phabricator.wikimedia.org/T307410) [16:56:25] (03PS1) 10Clare Ming: Factor out a separate scroll observer for the TOC A/B test, which should be fired separately from the page title observer used by the sticky header and TOC [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790443 (https://phabricator.wikimedia.org/T307952) [16:57:17] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 103.50, 101.05, 98.16 https://wikitech.wikimedia.org/wiki/Swift [17:00:48] (03PS1) 10Stang: commonswiki: Add *.toolforge.org to wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791059 (https://phabricator.wikimedia.org/T78167) [17:10:16] (03PS1) 10Btullis: Update the LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/791061 (https://phabricator.wikimedia.org/T301462) [17:12:07] (03CR) 10Dzahn: [C: 03+2] "thanks all, also for taking care it's https. I've been trying to replace all the http links where possible" [puppet] - 10https://gerrit.wikimedia.org/r/790998 (owner: 10Hashar) [17:13:04] (03CR) 10jerkins-bot: [V: 04-1] Factor out a separate scroll observer for the TOC A/B test, which should be fired separately from the page title observer used by the sticky header and TOC [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790443 (https://phabricator.wikimedia.org/T307952) (owner: 10Clare Ming) [17:14:13] (03CR) 10Majavah: [C: 04-1] "I'd still prefer if we added invividual tools instead of all of toolforge / cloud vps (*.wmcloud.org) based on need, like we do for any ot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791059 (https://phabricator.wikimedia.org/T78167) (owner: 10Stang) [17:14:55] (03CR) 10Dzahn: [C: 03+2] "i'll just leave it at 1hr intervals, no need to worry about it I think." [puppet] - 10https://gerrit.wikimedia.org/r/790998 (owner: 10Hashar) [17:15:27] (03PS4) 10Dzahn: create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 [17:16:27] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:18:49] (03CR) 10Clare Ming: "recheck" [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790443 (https://phabricator.wikimedia.org/T307952) (owner: 10Clare Ming) [17:18:55] (03CR) 10Brennen Bearnes: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [17:18:59] (03CR) 10Jgreen: [C: 03+2] Turn on monitoring for new frack hosts [puppet] - 10https://gerrit.wikimedia.org/r/790773 (owner: 10Dwisehaupt) [17:19:03] (03CR) 10Stang: commonswiki: Add *.toolforge.org to wgCopyUploadsDomains allowlist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791059 (https://phabricator.wikimedia.org/T78167) (owner: 10Stang) [17:27:56] (03CR) 10Dzahn: [C: 03+1] gitlab: fix regex for restore version check [puppet] - 10https://gerrit.wikimedia.org/r/791029 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [17:30:41] mutante: thanks :] I still have to listen to those "between the brackets" podcast, hopefully having them in planet will cause me to start listening to them [17:32:41] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:33:23] hashar: :) glad to hear planet can make a difference [17:36:05] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (23) node(s) change every puppet run: contint1001, contint2001, cuminunpriv1001, ms-be1040, ms-be1068, ms-be1069, ms-be1070, ms-be1071, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, puppetmaster1001, puppetmaster2001, releases1002, releases2002, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2 [17:36:05] nos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [17:38:39] contint* doing changes on every run would be new [17:39:17] one of the issues here is there are always some hosts doing that but it stays slightly under the alerting treshold [17:40:43] ah, it's what I already reported as https://phabricator.wikimedia.org/T307740 [17:41:47] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: contint/releases/hosts with helm installed: puppet - Could not find group deployment - https://phabricator.wikimedia.org/T307740 (10Dzahn) This contributes to: ` 17:36 <+icinga-wm> PROBLEM - Ensure hosts are not performing a change on every pupp... [17:43:11] (03CR) 10Sergio Gimeno: [C: 04-1] "Missing to add the correct messageKey for social-latam-2022-A campaign" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [17:44:33] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [17:45:06] (03CR) 10Dzahn: [V: 03+2] create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 (owner: 10Dzahn) [17:46:51] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 2 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [17:50:52] (03PS2) 10Zabe: swift: migrate container stats cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/790761 (https://phabricator.wikimedia.org/T273673) [17:51:04] (03CR) 10Zabe: swift: migrate container stats cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790761 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:52:52] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/35197/" [puppet] - 10https://gerrit.wikimedia.org/r/790761 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:53:31] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "recheck" [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 (owner: 10Dzahn) [18:00:05] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220511T1800) [18:02:36] (03CR) 10Razzi: [C: 03+2] Use both dbproxy101[89] servers for both wikireplica services [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) (owner: 10Btullis) [18:06:38] !log razzi@lvs1020:~$ systemctl stop pybal.service to apply change https://gerrit.wikimedia.org/r/c/operations/puppet/+/779915 [18:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:15:37] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) HPE came back and asked me to reseat the Raid controller battery, this did not fix the issue, the NIC card is still flashing amber and during post I notice that the process... [18:17:04] The pybal error above happened when I stopped pybal to apply the change [18:22:16] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:24:57] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 107.18, 100.14, 95.79 https://wikitech.wikimedia.org/wiki/Swift [18:33:52] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:34:02] (03CR) 10Herron: [C: 03+1] Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [18:36:15] 10SRE-swift-storage, 10Data-Engineering, 10Data-Persistence, 10Privacy Engineering: Swift for differential privacy data publication - https://phabricator.wikimedia.org/T307245 (10Milimetric) @Htriedman: I know you're talking to @EChetty about this, we're triaging it to this column which is like a task "inc... [18:36:29] 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Cmjohnson) @Dzahn The server is dead, it will not power on, I attempted to get to basic start-up, 1 DIMM, 1 CPU, and still will not power on, Historically a main board swap is requi... [18:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:45:09] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) Can we resurrect this and I finish out the esams hosts? I'd like to close this out, its just shaming me with its age. Checking first since the traffic te... [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:04:52] (03CR) 10RLazarus: [V: 03+1 C: 03+2] deployment_server: Add keyholder identity for scap [puppet] - 10https://gerrit.wikimedia.org/r/790455 (https://phabricator.wikimedia.org/T307351) (owner: 10RLazarus) [19:04:55] 10SRE-swift-storage, 10Data-Engineering, 10Data-Persistence, 10Privacy Engineering: Swift for differential privacy data publication - https://phabricator.wikimedia.org/T307245 (10Htriedman) @Milimetric Thanks for the pointers on this process! I also just talked to @gmodena and think that we're starting to... [19:14:20] (03CR) 10Bernard Wang: [C: 03+1] "LGTM!" [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790443 (https://phabricator.wikimedia.org/T307952) (owner: 10Clare Ming) [19:18:38] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:19:29] !log Added new `scap` identity to keyholder on deploy[1002,2002] - T307351 [19:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:35] T307351: Add new user identity to Keyholder for scap - https://phabricator.wikimedia.org/T307351 [19:19:57] 10SRE, 10ops-eqiad, 10DBA: db1164 fails to POST/boot/etc - https://phabricator.wikimedia.org/T307198 (10Cmjohnson) 05Openβ†’03Resolved a:03Cmjohnson DIMM replaced and booted into the OS, I was able to update the firmware while it was offline. [19:20:42] 10SRE, 10SRE-Access-Requests, 10Scap, 10Patch-For-Review: Add new user identity to Keyholder for scap - https://phabricator.wikimedia.org/T307351 (10RLazarus) 05Openβ†’03Resolved a:03RLazarus This should be all set! ` rzl@deploy1002:~$ sudo run-puppet-agent [...] rzl@deploy1002:~$ sudo keyholder add /... [19:26:04] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T306129 (10Cmjohnson) 05Openβ†’03Resolved This was wrongly entered, it's connected to cloudstore1011 that was in xe-4/0/23. Updated netbox and ran homer [19:30:52] Thanks rzl! [19:31:49] dancy: of course! haven't done that before, let me know if I missed anything [19:32:54] (03PS1) 10Ebernhardson: [Beta Cluster] LabsServices: Move eventgate to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791070 (https://phabricator.wikimedia.org/T307862) [19:36:32] (03PS6) 10Brennen Bearnes: GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) [19:42:16] (03PS7) 10Brennen Bearnes: GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) [19:42:25] (03PS1) 10Dzahn: drop the staging directory and contents [container/miscweb] - 10https://gerrit.wikimedia.org/r/791071 [19:42:49] (03CR) 10jerkins-bot: [V: 04-1] drop the staging directory and contents [container/miscweb] - 10https://gerrit.wikimedia.org/r/791071 (owner: 10Dzahn) [19:43:27] (03PS2) 10Dzahn: drop the staging directory and contents [container/miscweb] - 10https://gerrit.wikimedia.org/r/791071 [19:45:44] (03PS1) 10Dzahn: rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072 [19:46:21] (03CR) 10Ottomata: [C: 03+2] "OOps. Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791070 (https://phabricator.wikimedia.org/T307862) (owner: 10Ebernhardson) [19:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:48:06] (03Merged) 10jenkins-bot: [Beta Cluster] LabsServices: Move eventgate to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791070 (https://phabricator.wikimedia.org/T307862) (owner: 10Ebernhardson) [19:53:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:54:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] RoanKattouw, Urbanecm, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220511T2000). [20:00:04] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for HMonroy and Dmaza - https://phabricator.wikimedia.org/T307737 (10HMonroy) Thank you @RLazarus !! [20:00:05] kemayo and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] i can deploy - since i'm on the list anyway [20:00:25] πŸ‘‹ [20:00:53] hi Kemayo: are you a self-server or would you like me to deploy? [20:01:01] cjming: I need you to deploy, alas. [20:01:07] alrighty [20:01:24] (03CR) 10Clare Ming: [C: 03+2] Release DiscussionTools new topic tool to former a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790395 (https://phabricator.wikimedia.org/T307410) (owner: 10DLynch) [20:02:09] (03Merged) 10jenkins-bot: Release DiscussionTools new topic tool to former a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790395 (https://phabricator.wikimedia.org/T307410) (owner: 10DLynch) [20:03:12] Kemayo: is your patch something that can be tested on mwdebug1001? [20:03:31] Should be testable there, yeah. [20:03:45] lmk and I will sync on your good word [20:04:10] cjming: Okay, it looks good. [20:04:18] great - syncing [20:05:03] (03CR) 10Clare Ming: [C: 03+2] Factor out a separate scroll observer for the TOC A/B test, which should be fired separately from the page title observer used by the sticky [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790443 (https://phabricator.wikimedia.org/T307952) (owner: 10Clare Ming) [20:05:28] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:790395|Release DiscussionTools new topic tool to former a/b test wikis (T307410)]] (duration: 00m 54s) [20:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:35] T307410: [Config Change] Enable the New Topic Tool as opt-out at A/B test wikis - https://phabricator.wikimedia.org/T307410 [20:05:36] Kemayo: your change should be live [20:05:44] cjming: Thanks for the help! [20:05:49] np! [20:05:56] doing my patch now [20:07:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:08:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:29] (03Merged) 10jenkins-bot: Factor out a separate scroll observer for the TOC A/B test, which should be fired separately from the page title observer used by the sticky header and TOC [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790443 (https://phabricator.wikimedia.org/T307952) (owner: 10Clare Ming) [20:25:50] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.10/skins/Vector/resources: Backport: [[gerrit:790443|Factor out a separate scroll observer for the TOC A/B test, which should be fired separately from the page title observer used by the sticky header and TOC (T307952 T307345)]] (duration: 00m 52s) [20:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:57] T307345: Sticky header disappears within lead sections of certain articles when old table of contents scrolls into view - https://phabricator.wikimedia.org/T307345 [20:25:57] T307952: Vector isnt firing 'scroll-to-toc' and 'scroll-to-top' events correctly - https://phabricator.wikimedia.org/T307952 [20:26:39] nothing else in the queue so I'll go ahead and close this window [20:26:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:09] !log end of UTC late backport & config window [20:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:27:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:19] !log T304542 running mwscript extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php hiwiki --verbose [20:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:24] T304542: Deploy "add a link" to third round of wikis - https://phabricator.wikimedia.org/T304542 [20:28:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:19] (03PS8) 10Brennen Bearnes: GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) [20:34:02] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 105.04, 101.18, 94.96 https://wikitech.wikimedia.org/wiki/Swift [20:43:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:53:42] (03CR) 10Dzahn: [C: 03+2] drop the staging directory and contents [container/miscweb] - 10https://gerrit.wikimedia.org/r/791071 (owner: 10Dzahn) [20:57:46] (03Merged) 10jenkins-bot: drop the staging directory and contents [container/miscweb] - 10https://gerrit.wikimedia.org/r/791071 (owner: 10Dzahn) [21:01:04] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) Synced with Brandon via IRC, and I'm good to resume this. Each host, one per cluster at a time (one upload, one text), disabling puppet agent, depooling,... [21:01:24] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) 05Openβ†’03In progress [21:01:27] 10SRE, 10Traffic-Icebox: Servers freezing across the caching cluster - https://phabricator.wikimedia.org/T238305 (10RobH) [21:06:34] RECOVERY - very high load average likely xfs on ms-be2055 is OK: OK - load average: 76.39, 72.59, 79.52 https://wikitech.wikimedia.org/wiki/Swift [21:28:27] (03CR) 10Brennen Bearnes: "This has been tested on gitlab-prod-1001.devtools.wmcloud.org and, once a Security Group is created to allow ingress to port 5050, seems t" [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [21:29:08] (03CR) 10Ahmon Dancy: [V: 03+1 C: 03+1] "Tested" [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [21:33:23] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [21:39:20] PROBLEM - Host cp3054 is DOWN: PING CRITICAL - Packet loss = 100% [21:41:45] my bad on that put into maint and fired updates at same time [21:41:51] should have done maint and waited 30 seconds [21:42:25] (03PS1) 10GergΕ‘ Tisza: Temporarily disable link recommendation backend on hi, uk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791085 (https://phabricator.wikimedia.org/T308186) [21:47:52] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [21:48:41] (03PS1) 10GergΕ‘ Tisza: Revert "Temporarily disable link recommendation backend on hi, uk" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791089 (https://phabricator.wikimedia.org/T308186) [21:49:52] RECOVERY - Host cp3054 is UP: PING OK - Packet loss = 0%, RTA = 81.04 ms [21:50:24] ACKNOWLEDGEMENT - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project andrew bogott I will investigate this if I get a moment before bedtime https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:01:24] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [22:19:30] (03PS1) 10Dzahn: define entrypoint only once instead of in each variant, simplify test variant [container/miscweb] - 10https://gerrit.wikimedia.org/r/791094 (https://phabricator.wikimedia.org/T300171) [22:28:32] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [22:33:14] PROBLEM - Host cp3058 is DOWN: PING CRITICAL - Packet loss = 100% [22:33:58] PROBLEM - Host cp3059 is DOWN: PING CRITICAL - Packet loss = 100% [22:36:43] i put them in to maint... whyyy echoing down.... [22:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:43:36] RECOVERY - Host cp3059 is UP: PING OK - Packet loss = 0%, RTA = 81.05 ms [22:44:00] RECOVERY - Host cp3058 is UP: PING OK - Packet loss = 0%, RTA = 81.02 ms [22:48:12] PROBLEM - purged service on cp3058 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:50:00] PROBLEM - purged service on cp3059 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:50:46] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:52:30] PROBLEM - Check systemd state on cp3058 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_purged.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:54:58] (03PS1) 10Dzahn: move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171) [22:56:56] (03CR) 10jerkins-bot: [V: 04-1] move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [23:00:41] (03PS2) 10Dzahn: move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171) [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:03:27] (03CR) 10jerkins-bot: [V: 04-1] move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [23:06:10] RECOVERY - purged service on cp3059 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:06:38] RECOVERY - purged service on cp3058 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:07:07] (03PS3) 10Dzahn: move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171) [23:15:35] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [23:23:47] (03CR) 10Dzahn: [C: 03+2] rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072 (owner: 10Dzahn) [23:23:51] (03PS2) 10Dzahn: rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072 [23:32:24] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:32:50] PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:34:14] PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:37:06] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:37:30] RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:38:52] RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale