[00:02:54] <wikibugs>	 (03PS2) 10Brennen Bearnes: WIP: GitLab: enable container registry (experimental) [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537)
[00:24:27] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[00:25:11] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "One item inline, but otherwise LGTM.  Adding traffic folks for visibility." [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[00:30:48] <wikibugs>	 (03CR) 10Cwhite: prometheus::blackbox::check: add new blackbox exporter check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[00:43:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[00:50:09] <icinga-wm>	 PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:21:00] <wikibugs>	 (03PS1) 10Dzahn: create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786
[01:23:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 (owner: 10Dzahn)
[01:26:53] <wikibugs>	 (03PS2) 10Dzahn: create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786
[01:28:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 (owner: 10Dzahn)
[01:31:37] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:34:37] <wikibugs>	 10SRE, 10RESTBase-API, 10Traffic: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Legoktm) >>! In T307610#7918787, @Mitar wrote: >> Because our edge traffic code enforces a stricter limit of ~100/s (for responses that aren't frontend cache hits due to popularity)...
[01:35:03] <wikibugs>	 (03CR) 10Dzahn: "variants: contains a bad variant name" [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 (owner: 10Dzahn)
[01:35:52] <wikibugs>	 10SRE, 10RESTBase-API, 10Traffic, 10Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Legoktm)
[01:38:21] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:39:25] <wikibugs>	 (03PS3) 10Dzahn: create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786
[01:41:34] <mutante>	 !log gitlab2001 - starting backup-restore service that had failed on previous automatic run
[01:41:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:42:51] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:41] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:51:00] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service daniel_zahn https://phabricator.wikimedia.org/T308089 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:51:19] <icinga-wm>	 RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:54:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 (owner: 10Dzahn)
[02:19:21] <icinga-wm>	 PROBLEM - Check systemd state on backup2003 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:26:19] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[02:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:38:13] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:43:15] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:43:49] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:43:51] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:43:52] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:43:55] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:09] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:11] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:12] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:12] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:13] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:17] <icinga-wm>	 RECOVERY - Check systemd state on backup2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:44:17] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:19] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:21] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:23] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:24] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:27] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:31] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:32] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:44:37] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:45:13] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:45:21] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:45:22] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:45:23] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:45:35] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:45:41] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:45:49] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:45:50] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:45:55] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:46:05] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:46:09] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:46:10] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:48:27] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:48:37] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:49:01] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:50:15] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:51:11] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:51:12] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:51:19] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:52:25] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:52:31] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:03] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:04] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:07] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:07] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:09] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:19] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:29] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:33] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:34] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:37] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:39] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:39] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:43] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:47] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:47] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:53:53] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:54:25] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:54:27] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:54:33] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:54:35] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:54:36] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:54:43] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1028 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:54:49] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:54:51] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1032 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:54:53] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:03] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:03] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:07] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:15] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:23] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1047 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:24] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:27] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1024 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:27] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:28] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:29] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1027 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:29] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:30] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:41] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1019 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:46] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:49] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:51] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:55:51] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:56:45] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:57:37] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:59:21] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:22:05] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:22:49] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[03:24:13] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.064 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:25:05] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[03:39:21] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:44:09] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[03:46:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:48:37] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1021 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[04:02:07] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[04:07:41] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:11:50] <icinga-wm>	 ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott needs a canary! https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[04:19:59] <icinga-wm>	 RECOVERY - SSH on furud.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:19:59] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1021 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[04:21:39] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_main_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:43:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:01:47] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:08:49] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:13:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2146 T301879', diff saved to https://phabricator.wikimedia.org/P27778 and previous config saved to /var/cache/conftool/dbconfig/20220511-051307-marostegui.json
[05:13:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:13:13] <stashbot>	 T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879
[05:13:59] <wikibugs>	 (03PS1) 10Marostegui: db2146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/790796 (https://phabricator.wikimedia.org/T301879)
[05:14:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/790796 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui)
[05:17:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2146 T301879', diff saved to https://phabricator.wikimedia.org/P27779 and previous config saved to /var/cache/conftool/dbconfig/20220511-051703-marostegui.json
[05:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:24:22] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/790440
[05:25:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/790440 (owner: 10Marostegui)
[05:27:01] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[05:34:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1172 to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27780 and previous config saved to /var/cache/conftool/dbconfig/20220511-053418-marostegui.json
[05:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:24] <stashbot>	 T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546
[05:35:12] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] "Let's do it the other way around, let's drop it from the production databases before removing it from this file. I am setting it to -1 unt" [puppet] - 10https://gerrit.wikimedia.org/r/790708 (https://phabricator.wikimedia.org/T262978) (owner: 10Zabe)
[05:49:18] <wikibugs>	 10SRE, 10conftool: Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Ladsgroup)
[05:49:29] <wikibugs>	 10SRE, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Ladsgroup)
[05:50:20] <wikibugs>	 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) I don't think we should attempt this schema change again. There's not much b...
[06:02:41] <wikibugs>	 10SRE, 10SRE-OnFire, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Marostegui)
[06:14:52] <wikibugs>	 (03PS1) 10Marostegui: db2146: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/790963 (https://phabricator.wikimedia.org/T308099)
[06:16:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2146: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/790963 (https://phabricator.wikimedia.org/T308099) (owner: 10Marostegui)
[06:31:55] <Amir1>	 !log mwscript maintenance/refreshImageMetadata.php --wiki=commonswiki --force --verbose --mediatype=AUDIO --mime audio/webm (T226311)
[06:32:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:01] <stashbot>	 T226311: Some WebM video files are misdetected as audio files due to the MIME detector not scanning enough bytes - https://phabricator.wikimedia.org/T226311
[06:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:40:20] <marostegui>	 !log db2146 set global innodb_max_dirty_pages_pct = 75; T307082
[06:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:25] <stashbot>	 T307082: Investigate spikes on db1132 (mariadb 10.6 host) - https://phabricator.wikimedia.org/T307082
[06:42:52] <wikibugs>	 10SRE, 10RESTBase-API, 10Traffic, 10Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Mitar) > Most of that is controlled by the SRE team at a level in front of the REST API, since the frontend caching layer is a shared resource across everything....
[06:53:00] <wikibugs>	 (03PS1) 10Slyngshede: Move Wiki Rsync fetch jobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790967 (https://phabricator.wikimedia.org/T273673)
[06:53:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move Wiki Rsync fetch jobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790967 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[06:57:29] <icinga-wm>	 PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:00:04] <jouncebot>	 Amir1, awight, Urbanecm, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220511T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:05:18] <moritzm>	 !log updating ganeti4* to Ganeti 3.0.1-1~bpo10+1 T307997
[07:05:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:24] <stashbot>	 T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997
[07:14:49] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: drain ms-be1059, skip cluster-OK checks [puppet] - 10https://gerrit.wikimedia.org/r/790694 (https://phabricator.wikimedia.org/T307667) (owner: 10MVernon)
[07:17:15] <wikibugs>	 (03PS7) 10Slyngshede: OpenLDAP, move restart cronjob to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673)
[07:18:21] <moritzm>	 !log drain ganeti4001 T307997
[07:18:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:26] <stashbot>	 T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997
[07:22:29] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti4004.ulsfo.wmnet with OS bullseye
[07:22:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:34] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4004.ulsfo.wmnet with OS bullseye
[07:32:14] <wikibugs>	 (03PS2) 10Slyngshede: Move Wiki Rsync fetch jobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790967 (https://phabricator.wikimedia.org/T273673)
[07:32:57] <wikibugs>	 10SRE, 10SRE-OnFire, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Joe) a:03Joe
[07:42:52] <wikibugs>	 (03PS1) 10Ayounsi: wmf-netbox: Netbox 3.2 compatibility [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/790975 (https://phabricator.wikimedia.org/T296452)
[07:44:26] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4004.ulsfo.wmnet with reason: host reimage
[07:44:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:37] <wikibugs>	 (03PS1) 10Elukey: Set celery 5 settings for ores2009 [puppet] - 10https://gerrit.wikimedia.org/r/790978 (https://phabricator.wikimedia.org/T303801)
[07:46:03] <icinga-wm>	 PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2022-06-10 07:44:58 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/
[07:46:20] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2054.codfw.wmnet with OS bullseye
[07:46:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:23] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2054.codfw.wmnet with OS bullseye
[07:46:30] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] deployment_server: Add keyholder identity for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790455 (https://phabricator.wikimedia.org/T307351) (owner: 10RLazarus)
[07:46:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set celery 5 settings for ores2009 [puppet] - 10https://gerrit.wikimedia.org/r/790978 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey)
[07:46:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:47:25] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4004.ulsfo.wmnet with reason: host reimage
[07:47:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:56] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ores2009.codfw.wmnet with OS buster
[07:51:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond)
[07:55:03] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Raise an error if wrong tags are used in a query. [software/conftool] - 10https://gerrit.wikimedia.org/r/790980 (https://phabricator.wikimedia.org/T308100)
[07:57:35] <icinga-wm>	 RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:59:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Interactive firmware prompts on Bullseye with some Broadcom NICs - https://phabricator.wikimedia.org/T308106 (10MoritzMuehlenhoff)
[07:59:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Interactive firmware prompts on Bullseye with some Broadcom NICs - https://phabricator.wikimedia.org/T308106 (10MoritzMuehlenhoff) p:05Triage→03Medium
[07:59:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Double the number of replicas of eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/790727 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis)
[07:59:56] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons: Server error 0 after uploading chunk - https://phabricator.wikimedia.org/T307874 (10Yann) 05Open→03Resolved a:03Yann The issue may come from my Internet connection. I will repost after further tests.
[08:00:45] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2054.codfw.wmnet with reason: host reimage
[08:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:57] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4004.ulsfo.wmnet with OS bullseye
[08:01:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:03] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4004.ulsfo.wmnet with OS bullseye completed: - ganeti4004 (**PASS**)   - Down...
[08:04:14] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2054.codfw.wmnet with reason: host reimage
[08:04:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:06] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update articlequality image [deployment-charts] - 10https://gerrit.wikimedia.org/r/790983 (https://phabricator.wikimedia.org/T301766)
[08:06:15] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons: Server error 0 after uploading chunk - https://phabricator.wikimedia.org/T307874 (10Aklapper) 05Resolved→03Invalid
[08:11:37] <wikibugs>	 (03CR) 10Awight: "I'm not sure why a wmf.11 branch was made, probably just an automatic process.  We can ignore because it will never be deployed: https://w" [extensions/FlaggedRevs] (wmf/1.39.0-wmf.11) - 10https://gerrit.wikimedia.org/r/790437 (https://phabricator.wikimedia.org/T307972) (owner: 10Thiemo Kreuz (WMDE))
[08:12:57] <marostegui>	 !log Rename revision_actor_temp on db1132 (s1) and db1114 (s8) T307906
[08:13:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:01] <stashbot>	 T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906
[08:15:35] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2009.codfw.wmnet with reason: host reimage
[08:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:08] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2054.codfw.wmnet with OS bullseye
[08:18:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:12] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2054.codfw.wmnet with OS bullseye completed: - ms-be2054 (**PASS**)   - Downtim...
[08:19:15] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2009.codfw.wmnet with reason: host reimage
[08:19:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:19:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:12] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update articlequality image [deployment-charts] - 10https://gerrit.wikimedia.org/r/790983 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou)
[08:20:16] <wikibugs>	 (03CR) 10Vgutierrez: requestctl: add AND NOT and OR NOT to the parsing grammar (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto)
[08:25:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (didn't test it though)" [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:27:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/790773 (owner: 10Dwisehaupt)
[08:40:23] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[08:40:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:46:21] <moritzm>	 !log logging an example as part of Simon's omboarding
[08:46:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:22] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "LGTM, though we should probably move this to a cookbook sooner than later." [puppet] - 10https://gerrit.wikimedia.org/r/790735 (owner: 10Majavah)
[08:49:01] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] rake_modules: add check for spdk licence header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond)
[08:50:00] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] OpenLDAP, move restart cronjob to systemd timer. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:50:43] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host ores2009.codfw.wmnet with OS buster
[08:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:13] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Double the number of replicas of eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/790727 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis)
[08:58:11] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet
[08:58:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:49] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1001.eqiad.wmnet
[09:00:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:27] <wikibugs>	 (03Merged) 10jenkins-bot: Double the number of replicas of eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/790727 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis)
[09:04:37] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[09:04:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:22] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[09:05:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:06] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[09:06:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:52] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[09:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:35] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[09:07:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:20] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I have now deployed the change to double the number of replica pods for eventgate-an...
[09:12:28] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[09:12:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:10] <jayme>	 jouncebot: next
[09:14:10] <jouncebot>	 In 3 hour(s) and 45 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220511T1300)
[09:15:37] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[09:15:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:23] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup2001.codfw.wmnet
[09:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:11] <wikibugs>	 (03CR) 10Kosta Harlan: Account creation: update live campaigns config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (owner: 10Sergio Gimeno)
[09:21:29] <wikibugs>	 (03PS1) 10Ayounsi: ganeti-netbox-sync: Add netbox 3.2 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790991 (https://phabricator.wikimedia.org/T296452)
[09:22:01] <wikibugs>	 (03PS2) 10Sergio Gimeno: Account creation: update live campaigns config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443)
[09:22:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Two typos inline, looks great!" [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond)
[09:23:18] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Convert dumps-status from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/789769 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[09:23:47] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1)
[09:23:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:45] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2055.codfw.wmnet with OS bullseye
[09:24:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:49] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2055.codfw.wmnet with OS bullseye
[09:24:59] <wikibugs>	 (03PS1) 10Jcrespo: BackupStatistics: Increase maximum backup time to a week [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/790992
[09:25:34] <wikibugs>	 (03PS4) 10Samtar: changeprop: Remove RESTBase page blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/767878 (https://phabricator.wikimedia.org/T274359)
[09:25:44] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] BackupStatistics: Increase maximum backup time to a week [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/790992 (owner: 10Jcrespo)
[09:27:06] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup2001.codfw.wmnet
[09:27:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:40] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] "I will deploy this change today. Thanks for the guidance jcrespo." [puppet] - 10https://gerrit.wikimedia.org/r/697992 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata)
[09:27:54] <jayme>	 !log systemctl reset-failed ifup@ens5.service on registry2003 - T273026
[09:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:58] <stashbot>	 T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026
[09:30:51] <TheresNoTime>	 hnowlan: reckon we might be able to get https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/767878 deployed today?
[09:33:03] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] dumps-eqiad-analytics_meta.sql.erb - add grants for new airflow_analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697992 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata)
[09:34:27] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for registry2003.codfw.wmnet
[09:34:27] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for registry2003.codfw.wmnet
[09:34:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:14] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup2002.codfw.wmnet
[09:35:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:19] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry2004.codfw.wmnet
[09:35:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:02] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2004.codfw.wmnet
[09:38:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7916593, @cmooney wrote: >> If there is any kind of anycast with the k8s prefixes (same prefix adverti...
[09:41:37] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup2002.codfw.wmnet
[09:41:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:01] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add 'includes' in private address reverse zones for new subnets [dns] - 10https://gerrit.wikimedia.org/r/790744 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[09:42:22] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[09:42:59] <godog>	 jouncebot: next
[09:42:59] <jouncebot>	 In 3 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220511T1300)
[09:43:02] <icinga-wm>	 PROBLEM - ganeti-mond running on ganeti4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti
[09:43:55] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet
[09:43:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:18] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[09:44:26] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:48:22] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8
[09:49:01] <wikibugs>	 (03PS1) 10Btullis: Add grants for new databases to be backed up on analytics-meta [puppet] - 10https://gerrit.wikimedia.org/r/790994 (https://phabricator.wikimedia.org/T308113)
[09:50:21] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1004.eqiad.wmnet
[09:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:24] <wikibugs>	 (03PS1) 10Ladsgroup: Set dewiki to read new for templatelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790997 (https://phabricator.wikimedia.org/T306673)
[12:35:29] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] filtered_tables: remove flaggedpage_config.fpc_select (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790708 (https://phabricator.wikimedia.org/T262978) (owner: 10Zabe)
[12:36:29] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon)
[12:42:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1172 to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27786 and previous config saved to /var/cache/conftool/dbconfig/20220511-124226-marostegui.json
[12:42:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:32] <stashbot>	 T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546
[12:43:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[12:44:13] <wikibugs>	 (03CR) 10Yaron Koren: [C: 03+1] "Thanks! As you'd expect, this looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/790998 (owner: 10Hashar)
[12:45:32] <wikibugs>	 (03CR) 10Tacsipacsi: filtered_tables: remove flaggedpage_config.fpc_select (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790708 (https://phabricator.wikimedia.org/T262978) (owner: 10Zabe)
[12:45:34] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti4001.ulsfo.wmnet with OS bullseye
[12:45:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye
[12:47:01] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) > Yup, it's there. Subtle but noticeable. Shaved off ~1s from p99 and ~80-100ms from...
[12:50:22] <wikibugs>	 (03PS3) 10Slyngshede: Move dumps exception checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/791015 (https://phabricator.wikimedia.org/T273673)
[12:50:38] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2007.codfw.wmnet with reason: host reimage
[12:50:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:28] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2007.codfw.wmnet with reason: host reimage
[12:54:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:22] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I have stumbled upon this issue with HAProxy, which seems to fit some of the symptom...
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, and Urbanecm: Dear deployers, time to do the UTC afternoon backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220511T1300).
[13:00:04] <jouncebot>	 awight: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:09] <awight>	 I can self-deploy this one.
[13:00:19] <Lucas_WMDE>	 ok!
[13:00:31] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deploying." [extensions/FlaggedRevs] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790436 (https://phabricator.wikimedia.org/T307972) (owner: 10Thiemo Kreuz (WMDE))
[13:01:03] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 105.17, 102.28, 94.08 https://wikitech.wikimedia.org/wiki/Swift
[13:03:51] <wikibugs>	 (03Merged) 10jenkins-bot: Fix incomplete FlaggedRevs::binaryFlagging() implementation [extensions/FlaggedRevs] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790436 (https://phabricator.wikimedia.org/T307972) (owner: 10Thiemo Kreuz (WMDE))
[13:04:32] <wikibugs>	 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui)
[13:05:03] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: +2 for esther-akinloose in Gerrit (mediawiki/extensions/VisualEditor) - https://phabricator.wikimedia.org/T305373 (10EAkinloose) I am good to go. Thanks @RLazarus !  {F35130227}
[13:07:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:07:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:27] <wikibugs>	 (03PS4) 10Kosta Harlan: Account creation: update live campaigns config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno)
[13:08:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:08:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:08:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:09:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:20] <wikibugs>	 (03PS1) 10Jelto: gitlab: fix regex for restore version check [puppet] - 10https://gerrit.wikimedia.org/r/791029 (https://phabricator.wikimedia.org/T274463)
[13:11:28] <logmsgbot>	 !log awight@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/FlaggedRevs/backend/FlaggedRevs.php: Backport: [[gerrit:790436|Fix incomplete FlaggedRevs::binaryFlagging() implementation (T307972)]] (duration: 00m 51s)
[13:11:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:33] <stashbot>	 T307972: Admins & Reviewers can't review revision - https://phabricator.wikimedia.org/T307972
[13:12:08] <wikibugs>	 (03PS2) 10Jelto: gitlab: fix regex for restore version check [puppet] - 10https://gerrit.wikimedia.org/r/791029 (https://phabricator.wikimedia.org/T274463)
[13:13:00] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:13:18] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35193/console" [puppet] - 10https://gerrit.wikimedia.org/r/791029 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[13:13:20] <logmsgbot>	 !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti4001.ulsfo.wmnet with OS bullseye
[13:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye executed with errors: - ganeti4001 (**FAIL**)   - Removed from...
[13:14:55] <awight>	 !log EU backports complete
[13:14:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:39] <icinga-wm>	 PROBLEM - puppet last run on gitlab2001 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:17:41] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: fix regex for restore version check [puppet] - 10https://gerrit.wikimedia.org/r/791029 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[13:21:23] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 105.18, 100.83, 99.23 https://wikitech.wikimedia.org/wiki/Swift
[13:21:24] <wikibugs>	 (03PS1) 10Majavah: wmcs: toolforge: grid: add a cookbook to reboot a grid queue [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/791030
[13:23:05] <wikibugs>	 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) Just pasting this for the record here with one other test I have done on db2...
[13:24:08] <wikibugs>	 (03PS2) 10Majavah: wmcs: toolforge: grid: add a cookbook to reboot a grid queue [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/791030
[13:25:37] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores2007.codfw.wmnet with OS buster
[13:25:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:09] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::prometheus: increase openstack-exporter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/791033 (https://phabricator.wikimedia.org/T302178)
[13:28:03] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:wmcs::prometheus: increase openstack-exporter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/791033 (https://phabricator.wikimedia.org/T302178) (owner: 10Majavah)
[13:28:15] <wikibugs>	 (03PS1) 10Klausman: hiera: Use celery v5 on ores2008 [puppet] - 10https://gerrit.wikimedia.org/r/791034 (https://phabricator.wikimedia.org/T303801)
[13:28:23] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis)
[13:28:41] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35194/console" [puppet] - 10https://gerrit.wikimedia.org/r/791033 (https://phabricator.wikimedia.org/T302178) (owner: 10Majavah)
[13:29:50] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: Use celery v5 on ores2008 [puppet] - 10https://gerrit.wikimedia.org/r/791034 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman)
[13:30:17] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ores2008.codfw.wmnet with OS buster
[13:30:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:29] <wikibugs>	 (03CR) 10Majavah: "not sure if this is needed after https://gerrit.wikimedia.org/r/c/operations/puppet/+/791033/?" [puppet] - 10https://gerrit.wikimedia.org/r/779515 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez)
[13:33:30] <wikibugs>	 (03PS20) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790
[13:33:42] <wikibugs>	 (03CR) 10Ottomata: "Thank you Ben!" [puppet] - 10https://gerrit.wikimedia.org/r/697992 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata)
[13:34:23] <wikibugs>	 (03CR) 10Ottomata: Enable basic monitoring of the airflow services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis)
[13:34:33] <wikibugs>	 (03CR) 10Ottomata: "One nit, +1 otherwise (or either way)" [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis)
[13:34:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond)
[13:34:42] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Enable basic monitoring of the airflow services [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis)
[13:36:51] <wikibugs>	 (03PS2) 10Btullis: Enable basic monitoring of the airflow services [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102)
[13:37:35] <wikibugs>	 (03CR) 10Jbond: rake: Add new rake task to convert a module to SPDX (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond)
[13:40:28] <wikibugs>	 (03CR) 10Muehlenhoff: rake: Add new rake task to convert a module to SPDX (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond)
[13:41:09] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:41:34] <wikibugs>	 (03CR) 10Btullis: Enable basic monitoring of the airflow services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis)
[13:41:42] <wikibugs>	 (03PS21) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790
[13:42:21] <icinga-wm>	 RECOVERY - puppet last run on gitlab2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:42:38] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Move Wiki Rsync fetch jobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790967 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[13:44:59] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Vgutierrez) Beginning with HAProxy 2.1 HTX is the only way to go. On another issue (https://g...
[13:45:31] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:54:31] <wikibugs>	 (03PS1) 10Slyngshede: Move tilerator regeneration from crontab to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/791035 (https://phabricator.wikimedia.org/T273673)
[13:54:54] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti4001.ulsfo.wmnet with OS bullseye
[13:54:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye
[13:55:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move tilerator regeneration from crontab to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/791035 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[13:55:33] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2008.codfw.wmnet with reason: host reimage
[13:55:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:18] <wikibugs>	 (03PS3) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705
[13:58:37] <wikibugs>	 (03PS2) 10Slyngshede: Move tilerator regeneration from crontab to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/791035 (https://phabricator.wikimedia.org/T273673)
[13:58:57] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2008.codfw.wmnet with reason: host reimage
[13:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:05] <wikibugs>	 (03PS4) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705
[13:59:32] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35196/console" [puppet] - 10https://gerrit.wikimedia.org/r/791035 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[14:05:17] <wikibugs>	 (03PS1) 10Elukey: Add Aiko and Kevin to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/791036 (https://phabricator.wikimedia.org/T307927)
[14:07:01] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] Raise an error if wrong tags are used in a query. [software/conftool] - 10https://gerrit.wikimedia.org/r/790980 (https://phabricator.wikimedia.org/T308100) (owner: 10Giuseppe Lavagetto)
[14:08:30] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[14:08:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:40] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] "In an ideal world, we would be able to separate/isolate secrets for different deployment destinations (i.e. ML k8s vs others) from each ot" [puppet] - 10https://gerrit.wikimedia.org/r/791036 (https://phabricator.wikimedia.org/T307927) (owner: 10Elukey)
[14:12:53] <wikibugs>	 (03PS4) 10Slyngshede: Move dumps exception checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/791015 (https://phabricator.wikimedia.org/T273673)
[14:15:25] <wikibugs>	 (03Abandoned) 10Slyngshede: Move dumps exception checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/791015 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[14:22:40] <logmsgbot>	 !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti4001.ulsfo.wmnet with OS bullseye
[14:22:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye executed with errors: - ganeti4001 (**FAIL**)   - Removed from...
[14:25:54] <wikibugs>	 (03PS5) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705
[14:29:40] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Looks sane to me, I didn't test it though." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 (owner: 10Jbond)
[14:31:25] <wikibugs>	 (03PS6) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705
[14:32:44] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores2008.codfw.wmnet with OS buster
[14:32:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:48] <wikibugs>	 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: wait reboot time timeout on aqs nodes - https://phabricator.wikimedia.org/T307260 (10Papaul) @Volans the only reason i see is the size of the disks and number of disks. We are using software RAID on 8x ~2TB disks
[14:36:46] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[14:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:40:08] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:42:24] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:45:18] <wikibugs>	 10SRE, 10serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397 (10Tsevener) Hi folks - automatic updates are going out at 100% now. Hopefully the load is looking okay on your end. Thanks!
[14:48:47] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "This is manually deployed on netbox-dev2002, setting to -1 until we have rebuild the netbox frontends" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 (owner: 10Jbond)
[14:51:35] <vgutierrez>	 !log depool ats-be on cp4032
[14:51:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:02] <wikibugs>	 (03PS7) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705
[14:58:14] <moritzm>	 !log installing qemu security updates on bullseye
[14:58:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:19] <vgutierrez>	 !log pool ats-be on cp4032
[15:00:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1172 to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27789 and previous config saved to /var/cache/conftool/dbconfig/20220511-150038-marostegui.json
[15:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:44] <stashbot>	 T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546
[15:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:04:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) > Even in the legacy setup (pre row e/f) adding new nodes requires manual error-prone gerrit changes like this one 35b0...
[15:13:23] <wikibugs>	 (03CR) 10CDanis: "Overall LGTM, just one thought" [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto)
[15:14:42] <wikibugs>	 (03PS21) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270)
[15:15:43] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] requestctl: add "find" command [software/conftool] - 10https://gerrit.wikimedia.org/r/790712 (https://phabricator.wikimedia.org/T305638) (owner: 10Giuseppe Lavagetto)
[15:17:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond)
[15:18:14] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] requestctl: add retry-after request header when applicable [software/conftool] - 10https://gerrit.wikimedia.org/r/791006 (https://phabricator.wikimedia.org/T305824) (owner: 10Giuseppe Lavagetto)
[15:18:24] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] "Yay!" [puppet] - 10https://gerrit.wikimedia.org/r/790998 (owner: 10Hashar)
[15:25:08] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@378e7ca]: (no justification provided)
[15:25:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:17] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@378e7ca]: (no justification provided) (duration: 00m 08s)
[15:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:59] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/790761 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[15:45:55] <wikibugs>	 (03CR) 10Volans: "Have you considered using SREBatchBase/SREBatchRunnerBase instead? If they don't work for this specific use case John and I would like to " [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[15:46:28] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@378e7ca]: (no justification provided)
[15:46:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:32] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@378e7ca]: (no justification provided) (duration: 00m 03s)
[15:46:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:49:45] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host ganeti4001.mgmt.ulsfo.wmnet with reboot policy FORCED
[15:49:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:03] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4001.mgmt.ulsfo.wmnet with reboot policy FORCED
[15:50:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:25] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Enable basic monitoring of the airflow services [puppet] - 10https://gerrit.wikimedia.org/r/791020 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis)
[15:53:39] <wikibugs>	 (03CR) 10Volans: Add a cookbook for rolling reboot of k8s clusters (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[15:58:08] <wikibugs>	 (03PS1) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/791050 (https://phabricator.wikimedia.org/T299797)
[16:00:45] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Update copyrights. [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/790645 (https://phabricator.wikimedia.org/T307398) (owner: 10Roman Stolar)
[16:02:42] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/791050 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking)
[16:02:46] <wikibugs>	 (03Merged) 10jenkins-bot: Update copyrights. [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/790645 (https://phabricator.wikimedia.org/T307398) (owner: 10Roman Stolar)
[16:09:42] <wikibugs>	 10SRE, 10conftool, 10Patch-For-Review: Provide a meaningful Retry-After value - https://phabricator.wikimedia.org/T305824 (10Joe) a:03Joe
[16:10:53] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2022-05-11-135122-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/791052 (https://phabricator.wikimedia.org/T307967)
[16:12:10] <wikibugs>	 (03CR) 10Hashar: planet: add between the brackets podcast (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790998 (owner: 10Hashar)
[16:12:28] <wikibugs>	 (03PS2) 10Hashar: planet: add between the brackets podcast [puppet] - 10https://gerrit.wikimedia.org/r/790998
[16:23:45] <wikibugs>	 10SRE, 10serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397 (10akosiaris) Thanks for the update. Load has somewhat increased on our side, albeit minimally.
[16:28:53] <icinga-wm>	 PROBLEM - Checks that the airflow database for airflow research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[16:29:22] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: requestctl: add AND NOT and OR NOT to the parsing grammar [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607)
[16:29:24] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: requestctl: add "find" command [software/conftool] - 10https://gerrit.wikimedia.org/r/790712 (https://phabricator.wikimedia.org/T305638)
[16:29:26] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: requestctl: add retry-after request header when applicable [software/conftool] - 10https://gerrit.wikimedia.org/r/791006 (https://phabricator.wikimedia.org/T305824)
[16:31:33] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[16:31:35] <icinga-wm>	 PROBLEM - Checks that the airflow database for airflow analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[16:34:15] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[16:34:49] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/790998 (owner: 10Hashar)
[16:39:07] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] requestctl: add AND NOT and OR NOT to the parsing grammar [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto)
[16:43:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:56:20] <wikibugs>	 (03PS2) 10DLynch: Release DiscussionTools new topic tool to former a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790395 (https://phabricator.wikimedia.org/T307410)
[16:56:25] <wikibugs>	 (03PS1) 10Clare Ming: Factor out a separate scroll observer for the TOC A/B test, which should be fired separately from the page title observer used by the sticky header and TOC [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790443 (https://phabricator.wikimedia.org/T307952)
[16:57:17] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 103.50, 101.05, 98.16 https://wikitech.wikimedia.org/wiki/Swift
[17:00:48] <wikibugs>	 (03PS1) 10Stang: commonswiki: Add *.toolforge.org to wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791059 (https://phabricator.wikimedia.org/T78167)
[17:10:16] <wikibugs>	 (03PS1) 10Btullis: Update the LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/791061 (https://phabricator.wikimedia.org/T301462)
[17:12:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks all, also for taking care it's https. I've been trying to replace all the http links where possible" [puppet] - 10https://gerrit.wikimedia.org/r/790998 (owner: 10Hashar)
[17:13:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Factor out a separate scroll observer for the TOC A/B test, which should be fired separately from the page title observer used by the sticky header and TOC [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790443 (https://phabricator.wikimedia.org/T307952) (owner: 10Clare Ming)
[17:14:13] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "I'd still prefer if we added invividual tools instead of all of toolforge / cloud vps (*.wmcloud.org) based on need, like we do for any ot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791059 (https://phabricator.wikimedia.org/T78167) (owner: 10Stang)
[17:14:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "i'll just leave it at 1hr intervals, no need to worry about it I think." [puppet] - 10https://gerrit.wikimedia.org/r/790998 (owner: 10Hashar)
[17:15:27] <wikibugs>	 (03PS4) 10Dzahn: create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786
[17:16:27] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:18:49] <wikibugs>	 (03CR) 10Clare Ming: "recheck" [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790443 (https://phabricator.wikimedia.org/T307952) (owner: 10Clare Ming)
[17:18:55] <wikibugs>	 (03CR) 10Brennen Bearnes: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes)
[17:18:59] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Turn on monitoring for new frack hosts [puppet] - 10https://gerrit.wikimedia.org/r/790773 (owner: 10Dwisehaupt)
[17:19:03] <wikibugs>	 (03CR) 10Stang: commonswiki: Add *.toolforge.org to wgCopyUploadsDomains allowlist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791059 (https://phabricator.wikimedia.org/T78167) (owner: 10Stang)
[17:27:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab: fix regex for restore version check [puppet] - 10https://gerrit.wikimedia.org/r/791029 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[17:30:41] <hashar>	 mutante: thanks :]  I still have to listen to those "between the brackets" podcast, hopefully having them in planet will cause me to start listening to them
[17:32:41] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:33:23] <mutante>	 hashar: :) glad to hear planet can make a difference
[17:36:05] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (23) node(s) change every puppet run: contint1001, contint2001, cuminunpriv1001, ms-be1040, ms-be1068, ms-be1069, ms-be1070, ms-be1071, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, puppetmaster1001, puppetmaster2001, releases1002, releases2002, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2
[17:36:05] <icinga-wm>	 nos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[17:38:39] <mutante>	 contint* doing changes on every run would be new
[17:39:17] <mutante>	 one of the issues here is there are always some hosts doing that but it stays slightly under the alerting treshold
[17:40:43] <mutante>	 ah, it's what I already reported as https://phabricator.wikimedia.org/T307740
[17:41:47] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: contint/releases/hosts with helm installed: puppet - Could not find group deployment - https://phabricator.wikimedia.org/T307740 (10Dzahn) This contributes to:   ` 17:36 <+icinga-wm> PROBLEM - Ensure hosts are not performing a change on every pupp...
[17:43:11] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 04-1] "Missing to add the correct messageKey for social-latam-2022-A campaign" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno)
[17:44:33] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[17:45:06] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2] create a separate variant for 15.wikipedia.org site [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 (owner: 10Dzahn)
[17:46:51] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 2 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[17:50:52] <wikibugs>	 (03PS2) 10Zabe: swift: migrate container stats cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/790761 (https://phabricator.wikimedia.org/T273673)
[17:51:04] <wikibugs>	 (03CR) 10Zabe: swift: migrate container stats cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790761 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[17:52:52] <wikibugs>	 (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/35197/" [puppet] - 10https://gerrit.wikimedia.org/r/790761 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[17:53:31] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] "recheck" [container/miscweb] - 10https://gerrit.wikimedia.org/r/790786 (owner: 10Dzahn)
[18:00:05] <jouncebot>	 Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220511T1800)
[18:02:36] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] Use both dbproxy101[89] servers for both wikireplica services [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) (owner: 10Btullis)
[18:06:38] <razzi>	 !log razzi@lvs1020:~$ systemctl stop pybal.service to apply change https://gerrit.wikimedia.org/r/c/operations/puppet/+/779915
[18:06:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:28] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:15:37] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) HPE came back and asked me to reseat the Raid controller battery, this did not fix the issue, the NIC card is still flashing amber and during post I notice that the process...
[18:17:04] <razzi>	 The pybal error above happened when I stopped pybal to apply the change
[18:22:16] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:24:57] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 107.18, 100.14, 95.79 https://wikitech.wikimedia.org/wiki/Swift
[18:33:52] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:34:02] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[18:36:15] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering, 10Data-Persistence, 10Privacy Engineering: Swift for differential privacy data publication - https://phabricator.wikimedia.org/T307245 (10Milimetric) @Htriedman: I know you're talking to @EChetty about this, we're triaging it to this column which is like a task "inc...
[18:36:29] <wikibugs>	 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Cmjohnson) @Dzahn The server is dead, it will not power on, I attempted to get to basic start-up, 1 DIMM, 1 CPU, and still will not power on,  Historically a main board swap is requi...
[18:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:45:09] <wikibugs>	 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) Can we resurrect this and I finish out the esams hosts?  I'd like to close this out, its just shaming me with its age.  Checking first since the traffic te...
[19:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:04:52] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1 C: 03+2] deployment_server: Add keyholder identity for scap [puppet] - 10https://gerrit.wikimedia.org/r/790455 (https://phabricator.wikimedia.org/T307351) (owner: 10RLazarus)
[19:04:55] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering, 10Data-Persistence, 10Privacy Engineering: Swift for differential privacy data publication - https://phabricator.wikimedia.org/T307245 (10Htriedman) @Milimetric Thanks for the pointers on this process! I also just talked to @gmodena and think that we're starting to...
[19:14:20] <wikibugs>	 (03CR) 10Bernard Wang: [C: 03+1] "LGTM!" [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790443 (https://phabricator.wikimedia.org/T307952) (owner: 10Clare Ming)
[19:18:38] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:19:29] <rzl>	 !log Added new `scap` identity to keyholder on deploy[1002,2002] - T307351
[19:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:35] <stashbot>	 T307351: Add new user identity to Keyholder for scap - https://phabricator.wikimedia.org/T307351
[19:19:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1164 fails to POST/boot/etc - https://phabricator.wikimedia.org/T307198 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson DIMM replaced and booted into the OS, I was able to update the firmware while it was offline.
[19:20:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Scap, 10Patch-For-Review: Add new user identity to Keyholder for scap - https://phabricator.wikimedia.org/T307351 (10RLazarus) 05Open→03Resolved a:03RLazarus This should be all set!  ` rzl@deploy1002:~$ sudo run-puppet-agent [...] rzl@deploy1002:~$ sudo keyholder add /...
[19:26:04] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T306129 (10Cmjohnson) 05Open→03Resolved This was wrongly entered, it's connected to cloudstore1011 that was in xe-4/0/23.  Updated netbox and ran homer
[19:30:52] <dancy>	 Thanks rzl!
[19:31:49] <rzl>	 dancy: of course! haven't done that before, let me know if I missed anything
[19:32:54] <wikibugs>	 (03PS1) 10Ebernhardson: [Beta Cluster] LabsServices: Move eventgate to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791070 (https://phabricator.wikimedia.org/T307862)
[19:36:32] <wikibugs>	 (03PS6) 10Brennen Bearnes: GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537)
[19:42:16] <wikibugs>	 (03PS7) 10Brennen Bearnes: GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537)
[19:42:25] <wikibugs>	 (03PS1) 10Dzahn: drop the staging directory and contents [container/miscweb] - 10https://gerrit.wikimedia.org/r/791071
[19:42:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] drop the staging directory and contents [container/miscweb] - 10https://gerrit.wikimedia.org/r/791071 (owner: 10Dzahn)
[19:43:27] <wikibugs>	 (03PS2) 10Dzahn: drop the staging directory and contents [container/miscweb] - 10https://gerrit.wikimedia.org/r/791071
[19:45:44] <wikibugs>	 (03PS1) 10Dzahn: rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072
[19:46:21] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "OOps.  Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791070 (https://phabricator.wikimedia.org/T307862) (owner: 10Ebernhardson)
[19:46:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:48:06] <wikibugs>	 (03Merged) 10jenkins-bot: [Beta Cluster] LabsServices: Move eventgate to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791070 (https://phabricator.wikimedia.org/T307862) (owner: 10Ebernhardson)
[19:53:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:54:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:54:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:55:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220511T2000).
[20:00:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for HMonroy and Dmaza - https://phabricator.wikimedia.org/T307737 (10HMonroy) Thank you @RLazarus !!
[20:00:05] <jouncebot>	 kemayo and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:19] <cjming>	 i can deploy - since i'm on the list anyway
[20:00:25] <Kemayo>	 👋
[20:00:53] <cjming>	 hi Kemayo: are you a self-server or would you like me to deploy?
[20:01:01] <Kemayo>	 cjming: I need you to deploy, alas.
[20:01:07] <cjming>	 alrighty
[20:01:24] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Release DiscussionTools new topic tool to former a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790395 (https://phabricator.wikimedia.org/T307410) (owner: 10DLynch)
[20:02:09] <wikibugs>	 (03Merged) 10jenkins-bot: Release DiscussionTools new topic tool to former a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790395 (https://phabricator.wikimedia.org/T307410) (owner: 10DLynch)
[20:03:12] <cjming>	 Kemayo: is your patch something that can be tested on mwdebug1001?
[20:03:31] <Kemayo>	 Should be testable there, yeah.
[20:03:45] <cjming>	 lmk and I will sync on your good word
[20:04:10] <Kemayo>	 cjming: Okay, it looks good.
[20:04:18] <cjming>	 great - syncing
[20:05:03] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Factor out a separate scroll observer for the TOC A/B test, which should be fired separately from the page title observer used by the sticky [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790443 (https://phabricator.wikimedia.org/T307952) (owner: 10Clare Ming)
[20:05:28] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:790395|Release DiscussionTools new topic tool to former a/b test wikis (T307410)]] (duration: 00m 54s)
[20:05:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:35] <stashbot>	 T307410: [Config Change] Enable the New Topic Tool as opt-out at A/B test wikis - https://phabricator.wikimedia.org/T307410
[20:05:36] <cjming>	 Kemayo: your change should be live
[20:05:44] <Kemayo>	 cjming: Thanks for the help!
[20:05:49] <cjming>	 np!
[20:05:56] <cjming>	 doing my patch now
[20:07:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:07:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:08:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:08:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:11:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:29] <wikibugs>	 (03Merged) 10jenkins-bot: Factor out a separate scroll observer for the TOC A/B test, which should be fired separately from the page title observer used by the sticky header and TOC [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790443 (https://phabricator.wikimedia.org/T307952) (owner: 10Clare Ming)
[20:25:50] <logmsgbot>	 !log cjming@deploy1002 Synchronized php-1.39.0-wmf.10/skins/Vector/resources: Backport: [[gerrit:790443|Factor out a separate scroll observer for the TOC A/B test, which should be fired separately from the page title observer used by the sticky header and TOC (T307952 T307345)]] (duration: 00m 52s)
[20:25:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:57] <stashbot>	 T307345: Sticky header disappears within lead sections of certain articles when old table of contents scrolls into view - https://phabricator.wikimedia.org/T307345
[20:25:57] <stashbot>	 T307952: Vector isnt firing  'scroll-to-toc' and 'scroll-to-top' events correctly  - https://phabricator.wikimedia.org/T307952
[20:26:39] <cjming>	 nothing else in the queue so I'll go ahead and close this window
[20:26:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:26:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:09] <cjming>	 !log end of UTC late backport & config window
[20:27:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:27:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:27:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:19] <tgr>	 !log T304542 running mwscript extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php hiwiki --verbose
[20:28:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:24] <stashbot>	 T304542: Deploy "add a link" to third round of wikis - https://phabricator.wikimedia.org/T304542
[20:28:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:28:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:19] <wikibugs>	 (03PS8) 10Brennen Bearnes: GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537)
[20:34:02] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 105.04, 101.18, 94.96 https://wikitech.wikimedia.org/wiki/Swift
[20:43:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[20:53:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] drop the staging directory and contents [container/miscweb] - 10https://gerrit.wikimedia.org/r/791071 (owner: 10Dzahn)
[20:57:46] <wikibugs>	 (03Merged) 10jenkins-bot: drop the staging directory and contents [container/miscweb] - 10https://gerrit.wikimedia.org/r/791071 (owner: 10Dzahn)
[21:01:04] <wikibugs>	 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) Synced with Brandon via IRC, and I'm good to resume this.  Each host, one per cluster at a time (one upload, one text), disabling puppet agent, depooling,...
[21:01:24] <wikibugs>	 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) 05Open→03In progress
[21:01:27] <wikibugs>	 10SRE, 10Traffic-Icebox: Servers freezing across the caching cluster - https://phabricator.wikimedia.org/T238305 (10RobH)
[21:06:34] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2055 is OK: OK - load average: 76.39, 72.59, 79.52 https://wikitech.wikimedia.org/wiki/Swift
[21:28:27] <wikibugs>	 (03CR) 10Brennen Bearnes: "This has been tested on gitlab-prod-1001.devtools.wmcloud.org and, once a Security Group is created to allow ingress to port 5050, seems t" [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes)
[21:29:08] <wikibugs>	 (03CR) 10Ahmon Dancy: [V: 03+1 C: 03+1] "Tested" [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes)
[21:33:23] <wikibugs>	 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH)
[21:39:20] <icinga-wm>	 PROBLEM - Host cp3054 is DOWN: PING CRITICAL - Packet loss = 100%
[21:41:45] <robh>	 my bad on that put into maint and fired updates at same time
[21:41:51] <robh>	 should have done maint and waited 30 seconds
[21:42:25] <wikibugs>	 (03PS1) 10Gergő Tisza: Temporarily disable link recommendation backend on hi, uk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791085 (https://phabricator.wikimedia.org/T308186)
[21:47:52] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[21:48:41] <wikibugs>	 (03PS1) 10Gergő Tisza: Revert "Temporarily disable link recommendation backend on hi, uk" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791089 (https://phabricator.wikimedia.org/T308186)
[21:49:52] <icinga-wm>	 RECOVERY - Host cp3054 is UP: PING OK - Packet loss = 0%, RTA = 81.04 ms
[21:50:24] <icinga-wm>	 ACKNOWLEDGEMENT - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project andrew bogott I will investigate this if I get a moment before bedtime https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[22:01:24] <wikibugs>	 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH)
[22:19:30] <wikibugs>	 (03PS1) 10Dzahn: define entrypoint only once instead of in each variant, simplify test variant [container/miscweb] - 10https://gerrit.wikimedia.org/r/791094 (https://phabricator.wikimedia.org/T300171)
[22:28:32] <wikibugs>	 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH)
[22:33:14] <icinga-wm>	 PROBLEM - Host cp3058 is DOWN: PING CRITICAL - Packet loss = 100%
[22:33:58] <icinga-wm>	 PROBLEM - Host cp3059 is DOWN: PING CRITICAL - Packet loss = 100%
[22:36:43] <robh>	 i put them in to maint... whyyy echoing down....
[22:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:43:36] <icinga-wm>	 RECOVERY - Host cp3059 is UP: PING OK - Packet loss = 0%, RTA = 81.05 ms
[22:44:00] <icinga-wm>	 RECOVERY - Host cp3058 is UP: PING OK - Packet loss = 0%, RTA = 81.02 ms
[22:48:12] <icinga-wm>	 PROBLEM - purged service on cp3058 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:50:00] <icinga-wm>	 PROBLEM - purged service on cp3059 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:50:46] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:52:30] <icinga-wm>	 PROBLEM - Check systemd state on cp3058 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_purged.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:52:36] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:54:58] <wikibugs>	 (03PS1) 10Dzahn: move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171)
[22:56:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[23:00:41] <wikibugs>	 (03PS2) 10Dzahn: move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171)
[23:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:03:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[23:06:10] <icinga-wm>	 RECOVERY - purged service on cp3059 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:06:38] <icinga-wm>	 RECOVERY - purged service on cp3058 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:07:07] <wikibugs>	 (03PS3) 10Dzahn: move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171)
[23:15:35] <wikibugs>	 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH)
[23:23:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072 (owner: 10Dzahn)
[23:23:51] <wikibugs>	 (03PS2) 10Dzahn: rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072
[23:32:24] <icinga-wm>	 PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:32:50] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:34:14] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:37:06] <icinga-wm>	 RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:37:30] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:38:52] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:46:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale