[00:00:04] <jouncebot>	 twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T0000).
[00:02:02] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage
[00:02:03] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8
[00:02:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:02:05] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage
[00:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:03:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[00:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:04:43] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: host reimage
[00:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:05:29] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage
[00:05:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:07:06] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage
[00:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:09:43] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: host reimage
[00:09:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:10:46] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[00:20:18] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[00:26:08] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[00:27:01] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1045.eqiad.wmnet with OS bullseye
[00:27:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:44] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:28:54] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1044.eqiad.wmnet with OS bullseye
[00:28:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:29:10] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[00:33:41] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1046.eqiad.wmnet with OS bullseye
[00:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:28] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1046 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[00:38:38] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1044 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[00:51:48] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:08:31] <wikibugs>	 10SRE, 10MediaWiki-Stakeholders-Group, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Performance-Team (Radar): RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588 (10Renoirb) This has been closed? Has an equivalent idea started under a different name?
[01:18:10] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:34:22] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1002.eqiad.wmnet with OS bullseye
[01:34:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:34:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[01:34:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:36:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[01:36:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[01:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:36:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:38:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:38:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[01:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:43:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:43:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[01:43:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:44:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[01:44:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[01:44:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:44:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:45:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[01:45:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:45:37] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt-wdqs1002.eqiad.wmnet with OS bullseye
[01:45:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:59:55] <icinga-wm>	 PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:05:22] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773380 (https://phabricator.wikimedia.org/T128546)
[02:10:53] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:11:26] <wikibugs>	 (03PS2) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773380 (https://phabricator.wikimedia.org/T282012)
[02:13:05] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:01:05] <icinga-wm>	 RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:14:07] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:10:05] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:49:16] <wikibugs>	 (03PS4) 10NguoiDungKhongDinhDanh: Fix I7ce58529cdd320a9500dc215291ef1c369cee9d3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773320 (https://phabricator.wikimedia.org/T303579)
[05:14:29] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:09:50] <wikibugs>	 (03PS1) 10Razzi: karapace: remove Type=notify [puppet] - 10https://gerrit.wikimedia.org/r/773387 (https://phabricator.wikimedia.org/T301565)
[06:16:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[06:16:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[06:16:51] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on 12 hosts with reason: Maintenance
[06:16:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:16:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on 12 hosts with reason: Maintenance
[06:17:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:21] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:37:37] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 129, down: 6, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:48:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P23012 and previous config saved to /var/cache/conftool/dbconfig/20220324-064823-root.json
[06:48:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:59] <icinga-wm>	 RECOVERY - puppet last run on ml-serve1001 is OK: OK: Puppet is currently disabled (elukey - cni testing), not alerting. Last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:58:37] <wikibugs>	 (03PS1) 10Elukey: install_server: update netboot settings for kubernetes nodes on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/773389 (https://phabricator.wikimedia.org/T300744)
[06:58:39] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs settings for kubernetes1012 [puppet] - 10https://gerrit.wikimedia.org/r/773390 (https://phabricator.wikimedia.org/T300744)
[06:59:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 for testing', diff saved to https://phabricator.wikimedia.org/P23013 and previous config saved to /var/cache/conftool/dbconfig/20220324-065940-marostegui.json
[06:59:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:04] <jouncebot>	 Amir1 and apergos: May I have your attention please! UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T0700)
[07:00:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: update netboot settings for kubernetes nodes on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/773389 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[07:00:24] <apergos>	 There are no trainees signed up and no patches scheduled in the window
[07:00:38] <apergos>	 maybe just as well since for some of us this is happening at 9 am :-D
[07:00:51] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:02:03] <elukey>	 checking --^
[07:03:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P23014 and previous config saved to /var/cache/conftool/dbconfig/20220324-070327-root.json
[07:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: After testing', diff saved to https://phabricator.wikimedia.org/P23015 and previous config saved to /var/cache/conftool/dbconfig/20220324-070513-root.json
[07:05:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:12] <wikibugs>	 (03PS1) 10Marostegui: db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/773391
[07:07:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/773391 (owner: 10Marostegui)
[07:08:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs settings for kubernetes1012 [puppet] - 10https://gerrit.wikimedia.org/r/773390 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[07:08:55] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1012.eqiad.wmnet with OS bullseye
[07:09:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:37] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:17:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1012.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:18:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P23016 and previous config saved to /var/cache/conftool/dbconfig/20220324-071832-root.json
[07:18:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: After testing', diff saved to https://phabricator.wikimedia.org/P23017 and previous config saved to /var/cache/conftool/dbconfig/20220324-072017-root.json
[07:20:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:23] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1012.eqiad.wmnet with reason: host reimage
[07:24:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:44] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1012.eqiad.wmnet with reason: host reimage
[07:27:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:52] <wikibugs>	 (03CR) 10Tchanders: [C: 03+1] Add IPInfo to BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran)
[07:33:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P23018 and previous config saved to /var/cache/conftool/dbconfig/20220324-073337-root.json
[07:33:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: After testing', diff saved to https://phabricator.wikimedia.org/P23019 and previous config saved to /var/cache/conftool/dbconfig/20220324-073520-root.json
[07:35:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:11] <wikibugs>	 (03CR) 10Tchanders: [C: 03+1] Add IPInfo to BetaFeatures (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran)
[07:39:23] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1012.eqiad.wmnet with OS bullseye
[07:39:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:48] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:42:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1012.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:44:59] <wikibugs>	 (03PS4) 10Majavah: kubeadm::helm: use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/773274
[07:45:01] <wikibugs>	 (03PS4) 10Majavah: kubeadm::helm: configure default HELMFILE_ENVIRONMENT [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532)
[07:45:03] <wikibugs>	 (03PS1) 10Majavah: kubeadm::helm: remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/773438
[07:45:30] <icinga-wm>	 PROBLEM - Check systemd state on netflow6001 is CRITICAL: CRITICAL - degraded: The following units failed: sfacctd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:48:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P23020 and previous config saved to /var/cache/conftool/dbconfig/20220324-074841-root.json
[07:48:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: After testing', diff saved to https://phabricator.wikimedia.org/P23021 and previous config saved to /var/cache/conftool/dbconfig/20220324-075024-root.json
[07:50:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:45] <icinga-wm>	 RECOVERY - Check systemd state on netflow6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:57:42] <wikibugs>	 (03PS1) 10Marostegui: switchover-tmpl.sh: Add "Affected wikis" field [software] - 10https://gerrit.wikimedia.org/r/773440 (https://phabricator.wikimedia.org/T303605)
[07:58:27] <wikibugs>	 (03PS1) 10Majavah: Remove unused CentralAuth settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773441
[08:00:05] <jouncebot>	 dancy, hashar, brennen, dduvall, jeena, and jnuche: Time to snap out of that daydream and deploy 🚂🧪Trainsperiment Week Deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T0800).
[08:00:41] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Update s4 backup in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/773442 (https://phabricator.wikimedia.org/T299764)
[08:03:44] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1013 [puppet] - 10https://gerrit.wikimedia.org/r/773443 (https://phabricator.wikimedia.org/T300744)
[08:05:05] <wikibugs>	 (03PS1) 10Jcrespo: Add new command line utility to update existing metadata [software/mediabackups] - 10https://gerrit.wikimedia.org/r/773444 (https://phabricator.wikimedia.org/T299764)
[08:05:10] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting: Icinga alert for hosts with no Puppet roles - https://phabricator.wikimedia.org/T238006 (10fgiunchedi) 05Open→03Declined I think nowadays an host with no role will cause puppet to fail and therefore the reimage cookbook to fail to...
[08:05:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: After testing', diff saved to https://phabricator.wikimedia.org/P23022 and previous config saved to /var/cache/conftool/dbconfig/20220324-080528-root.json
[08:05:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:16] <icinga-wm>	 PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:11:43] <marostegui>	 !log dbmaint s7@codfw T302658
[08:11:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:50] <stashbot>	 T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658
[08:12:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1013 [puppet] - 10https://gerrit.wikimedia.org/r/773443 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[08:12:35] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1013.eqiad.wmnet with OS bullseye
[08:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:11] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mediabackup: Update s4 backup in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/773442 (https://phabricator.wikimedia.org/T299764) (owner: 10Jcrespo)
[08:21:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1013.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:23:52] <icinga-wm>	 PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[08:27:55] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1013.eqiad.wmnet with reason: host reimage
[08:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:27] <wikibugs>	 10SRE, 10Traffic-Icebox: Multiple ATS HTTP2 stats missing from Prometheus - https://phabricator.wikimedia.org/T292817 (10fgiunchedi) - observability since there's no action ATM, feel free to retag when needed
[08:29:32] <wikibugs>	 (03PS1) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448
[08:30:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Introduce requestctl (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[08:31:19] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1013.eqiad.wmnet with reason: host reimage
[08:31:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1013.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:32:13] <wikibugs>	 (03PS2) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448
[08:32:28] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1013.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:33:36] <wikibugs>	 (03PS3) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448
[08:33:43] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: cache: enable dynamic bans everywhere [puppet] - 10https://gerrit.wikimedia.org/r/769390 (https://phabricator.wikimedia.org/T302471)
[08:35:56] <wikibugs>	 (03PS4) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448
[08:36:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache: enable dynamic bans everywhere [puppet] - 10https://gerrit.wikimedia.org/r/769390 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[08:36:27] <mmandere>	 !log depool cp1078 for reimage - T290005
[08:36:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:32] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[08:37:28] <wikibugs>	 (03PS3) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240)
[08:37:28] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1013.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:38:15] <hashar>	 good morning
[08:38:28] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1013.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:39:20] <wikibugs>	 (03Abandoned) 10Hashar: parser: Revert 2 media gallery changes [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773317 (https://phabricator.wikimedia.org/T304564) (owner: 10Brennen Bearnes)
[08:39:26] <wikibugs>	 10SRE, 10Observability-Metrics: Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Fixed with Grafana 8 upgrade
[08:39:58] <hashar>	 jnuche: turns out wmf.4 got rolled back yesterday due to a parser issue ( https://gerrit.wikimedia.org/r/q/bug:T304564 )
[08:39:58] <stashbot>	 T304564: MWException: `[title]` is not a valid file title. - https://phabricator.wikimedia.org/T304564
[08:40:43] <hashar>	 I am wondering whether we should move wmf.4 forward today or just abandon it :D
[08:41:31] <hashar>	 I guess I will do the backport
[08:41:33] <hashar>	 revisit the log
[08:41:37] <hashar>	 and move wmf.4 forward again
[08:41:40] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp1078 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773198 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[08:42:04] <wikibugs>	 (03PS5) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448
[08:42:13] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1013.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:43:06] <wikibugs>	 (03PS1) 10Abbe98: Change foaf:homepage value from Literal to IRI [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773450
[08:43:11] <jnuche>	 hashar: let me try my hand at the backport
[08:43:20] <wikibugs>	 (03PS1) 10Hashar: Broken media in galleries might not have the file namespace [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773321 (https://phabricator.wikimedia.org/T304564)
[08:43:23] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1013.eqiad.wmnet with OS bullseye
[08:43:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:26] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/enabled=true; selector: name=parameter_q,cluster=cache-text
[08:43:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:15] <hashar>	 oh there is another blocker https://phabricator.wikimedia.org/T304559  :-\
[08:44:39] <marostegui>	 !log dbmaint s7@eqiad T302658
[08:44:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:45] <stashbot>	 T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658
[08:45:11] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/enabled=false; selector: name=parameter_q,cluster=cache-text
[08:45:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:47] <wikibugs>	 (03CR) 10Abbe98: "Hi! Adding you as a reviewer because you have made similar patches in the past." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773450 (owner: 10Abbe98)
[08:48:29] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1078.eqiad.wmnet with OS buster
[08:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:37] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1078.eqiad.wmnet with OS buster
[08:55:16] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] Broken media in galleries might not have the file namespace [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773321 (https://phabricator.wikimedia.org/T304564) (owner: 10Hashar)
[08:55:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:00:57] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/enabled=true; selector: name=parameter_q,cluster=cache-text
[09:01:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:03] <wikibugs>	 (03CR) 10Jaime Nuche: [V: 03+2] Broken media in galleries might not have the file namespace [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773321 (https://phabricator.wikimedia.org/T304564) (owner: 10Hashar)
[09:05:29] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1078.eqiad.wmnet with reason: host reimage
[09:05:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:46] <icinga-wm>	 RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:07:26] <wikibugs>	 (03CR) 10Jaime Nuche: [V: 03+2 C: 03+2] Broken media in galleries might not have the file namespace [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773321 (https://phabricator.wikimedia.org/T304564) (owner: 10Hashar)
[09:08:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for imagecatalog [puppet] - 10https://gerrit.wikimedia.org/r/773205 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:08:08] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2035 is CRITICAL: reload-vcl failed to run since 0h, 6 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[09:08:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for uwsgi/graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/773190 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:08:56] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1078.eqiad.wmnet with reason: host reimage
[09:08:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cache::base: add check to netpmapper modification [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471)
[09:18:04] <wikibugs>	 (03PS6) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448
[09:18:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend edges alias to also include drmrs now that the site is live [puppet] - 10https://gerrit.wikimedia.org/r/773452
[09:18:31] <wikibugs>	 (03Merged) 10jenkins-bot: Broken media in galleries might not have the file namespace [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773321 (https://phabricator.wikimedia.org/T304564) (owner: 10Hashar)
[09:20:38] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:20:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:20:58] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: cache::base: add check to netpmapper modification [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471)
[09:21:00] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish::frontend: rmeove temporary rate-limits [puppet] - 10https://gerrit.wikimedia.org/r/773454
[09:21:02] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish::frontend: remove normalization for parameter [puppet] - 10https://gerrit.wikimedia.org/r/773455
[09:26:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:26:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:07] <logmsgbot>	 !log jnuche@deploy1002 Synchronized php-1.39.0-wmf.4/includes/Linker.php: (no justification provided) (duration: 00m 50s)
[09:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:28:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:28:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:29:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:22] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1078.eqiad.wmnet with OS buster
[09:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1078.eqiad.wmnet with OS buster com...
[09:31:37] <mmandere>	 !log pool cp1078 with HAProxy as TLS termination layer - T290005
[09:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:42] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[09:32:55] <wikibugs>	 (03PS1) 10Phedenskog: Add marcusolsson-json-datasource [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773456 (https://phabricator.wikimedia.org/T304585)
[09:40:23] <wikibugs>	 (03CR) 10David Caro: "Mostly questions, any nits can be ignored" [puppet] - 10https://gerrit.wikimedia.org/r/773448 (owner: 10Majavah)
[09:46:47] <wikibugs>	 (03PS7) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448
[09:47:52] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1096.eqiad.wmnet
[09:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:06] <wikibugs>	 (03PS8) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448
[09:51:12] <wikibugs>	 (03CR) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/773448 (owner: 10Majavah)
[09:53:28] <wikibugs>	 (03PS1) 10Ayounsi: Add sflow support to prod l3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/773458 (https://phabricator.wikimedia.org/T263277)
[09:56:11] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1096.eqiad.wmnet
[09:56:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:05] <jouncebot>	 mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1000).
[10:01:39] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1097.eqiad.wmnet
[10:01:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:53] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1014 [puppet] - 10https://gerrit.wikimedia.org/r/773466 (https://phabricator.wikimedia.org/T300744)
[10:06:55] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1017 [puppet] - 10https://gerrit.wikimedia.org/r/773467 (https://phabricator.wikimedia.org/T302208)
[10:08:51] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/773458 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi)
[10:09:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] decommission kubernetes[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/771850 (https://phabricator.wikimedia.org/T303044) (owner: 10Alexandros Kosiaris)
[10:09:25] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1014 [puppet] - 10https://gerrit.wikimedia.org/r/773466 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[10:09:55] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1014.eqiad.wmnet with OS bullseye
[10:09:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:34] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1097.eqiad.wmnet
[10:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:24] <wikibugs>	 (03CR) 10Ayounsi: "Example diff for lsw1-e2:" [homer/public] - 10https://gerrit.wikimedia.org/r/773458 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi)
[10:17:53] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add sflow support to prod l3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/773458 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi)
[10:18:05] <wikibugs>	 (03PS34) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454)
[10:18:31] <wikibugs>	 (03Merged) 10jenkins-bot: Add sflow support to prod l3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/773458 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi)
[10:19:13] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Traffic: cp1090.mgmt ssh port not accessible - https://phabricator.wikimedia.org/T304589 (10MMandere) p:05Triage→03Medium
[10:19:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1014.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:20:31] <mmandere>	 !log depool cp1076 for reimage - T290005
[10:20:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:36] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[10:22:03] <wikibugs>	 (03PS1) 10Ayounsi: Add eqiad EVPN overlay loopbacks to network::infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/773468 (https://phabricator.wikimedia.org/T263277)
[10:23:38] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp1076 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773199 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[10:24:33] <wikibugs>	 (03PS3) 10MMandere: site: Reimage cp1076 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773199 (https://phabricator.wikimedia.org/T290005)
[10:25:18] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1014.eqiad.wmnet with reason: host reimage
[10:25:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:05] <wikibugs>	 (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[10:26:16] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1098.eqiad.wmnet
[10:26:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:14] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1076.eqiad.wmnet with OS buster
[10:27:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:24] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1076.eqiad.wmnet with OS buster
[10:28:42] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1014.eqiad.wmnet with reason: host reimage
[10:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:27] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "thanks for working on this! some comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/773448 (owner: 10Majavah)
[10:31:09] <wikibugs>	 (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[10:31:29] <wikibugs>	 (03PS1) 10Ayounsi: Add static route leak for sflow collector in EVPN setup [homer/public] - 10https://gerrit.wikimedia.org/r/773470 (https://phabricator.wikimedia.org/T263277)
[10:32:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773448 (owner: 10Majavah)
[10:32:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/773448 (owner: 10Majavah)
[10:33:31] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648)
[10:33:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:34:41] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1098.eqiad.wmnet
[10:34:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:53] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1099.eqiad.wmnet
[10:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:52] <wikibugs>	 (03CR) 10JMeybohm: Add helm charts and a helmfile configuration for datahub (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[10:37:12] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648)
[10:39:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1014.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:40:57] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1014.eqiad.wmnet with OS bullseye
[10:41:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:24] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1099.eqiad.wmnet
[10:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:38] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1100.eqiad.wmnet
[10:42:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1017 [puppet] - 10https://gerrit.wikimedia.org/r/773467 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey)
[10:43:50] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1076.eqiad.wmnet with reason: host reimage
[10:43:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:17] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1017.eqiad.wmnet with OS bullseye
[10:45:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:44] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1076.eqiad.wmnet with reason: host reimage
[10:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:41] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1100.eqiad.wmnet
[10:49:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:04] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1101.eqiad.wmnet
[10:51:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:46] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648)
[10:54:28] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) kubernetes1014.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:56:15] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1101.eqiad.wmnet
[10:56:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:02] <elukey>	 1014 should not be alarming, checking
[11:00:13] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1017.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:00:40] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1017.eqiad.wmnet with reason: host reimage
[11:00:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:01] <elukey>	 weird, the calico pod on 1014 is up
[11:02:21] <elukey>	 and I don't see the alert anymore in alerts.w.o, maybe it is going to auto-solve
[11:04:09] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1017.eqiad.wmnet with reason: host reimage
[11:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:28] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1017.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:09:28] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1017.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:10:04] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1076.eqiad.wmnet with OS buster
[11:10:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:13] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1076.eqiad.wmnet with OS buster com...
[11:10:28] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1017.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:11:38] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "This is meant to be merged after a puppet run has gone through with just the previous patches right?" [puppet] - 10https://gerrit.wikimedia.org/r/773438 (owner: 10Majavah)
[11:12:14] <wikibugs>	 (03CR) 10Majavah: kubeadm::helm: remove absented file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773438 (owner: 10Majavah)
[11:14:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34530/console" [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto)
[11:14:26] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade.
[11:14:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Gotta love how deprecations happen within the same major api version." [puppet] - 10https://gerrit.wikimedia.org/r/773364 (https://phabricator.wikimedia.org/T303230) (owner: 10RLazarus)
[11:15:58] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1017.eqiad.wmnet with OS bullseye
[11:16:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:33] <wikibugs>	 (03PS1) 10Elukey: sre.kafka.roll-restart-brokers: generalize the restart reason [cookbooks] - 10https://gerrit.wikimedia.org/r/773475
[11:16:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, one typo inline." [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[11:18:45] <wikibugs>	 (03PS19) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117)
[11:19:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[11:20:13] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1017.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:21:39] <jbond>	 !log removing old api.svc.codfw.wmnet.pem and appservers.svc.codfw.wmnet.pem from root@puppetmaster1001:/var/lib/puppet/server/ssl/ca/signed#
[11:21:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:01] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648)
[11:22:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto)
[11:22:48] <jbond>	 !log puppet cert clean rendering.svc.eqiad.wmnet
[11:22:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[11:24:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto)
[11:26:45] <mmandere>	 !log pool cp1076 with HAProxy as TLS termination layer - T290005
[11:26:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:50] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[11:28:26] <wikibugs>	 (03PS35) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454)
[11:28:33] <wikibugs>	 (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[11:38:12] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::java: Also add component/jdk on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/773476
[11:39:04] <wikibugs>	 (03PS2) 10Daniel Kinzler: Set MW_USE_CONFIG_SCHEMA constant if file exists. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772937 (https://phabricator.wikimedia.org/T304460)
[11:41:23] <icinga-wm>	 RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate
[11:41:26] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/773476 (owner: 10Muehlenhoff)
[11:42:35] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[11:44:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[11:45:49] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] Remove unused CentralAuth settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773441 (owner: 10Majavah)
[11:46:13] <wikibugs>	 (03CR) 10Hoo man: [C: 04-1] "Thanks for looking into this, this should indeed be changed." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773450 (owner: 10Abbe98)
[11:46:19] <wikibugs>	 (03PS2) 10Majavah: kubeadm::helm: remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/773438
[11:46:21] <wikibugs>	 (03PS9) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448 (https://phabricator.wikimedia.org/T303931)
[11:47:06] <wikibugs>	 (03CR) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/773448 (https://phabricator.wikimedia.org/T303931) (owner: 10Majavah)
[11:47:25] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:47:41] <jynus>	 !log updating eqiad swift-commonswiki backups of originals T299764
[11:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:49] <stashbot>	 T299764: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764
[11:54:02] <wikibugs>	 (03PS1) 10Majavah: P:cache::varnish::frontend: fix duplicate resource declarations [puppet] - 10https://gerrit.wikimedia.org/r/773477
[11:55:48] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34531/console" [puppet] - 10https://gerrit.wikimedia.org/r/773477 (owner: 10Majavah)
[11:58:34] <wikibugs>	 (03CR) 10Majavah: "broke puppet on deployment-prep, fix is Icf78fb25cf7594ad1dc3dda72b5a09eddd018481" [puppet] - 10https://gerrit.wikimedia.org/r/772401 (owner: 10Giuseppe Lavagetto)
[11:58:52] <wikibugs>	 (03CR) 10Jbond: "lgtm" [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[12:01:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[12:05:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10JMeybohm) >>! In T304237#7790839, @Volans wrote: > ` > root@puppetmaster1001:~# for file in $(ls /var/lib/puppet/server/ssl/ca/signe...
[12:07:37] <wikibugs>	 (03PS7) 10Jbond: P:environment: Add ablilty to inject environment variables [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315)
[12:07:57] <wikibugs>	 (03CR) 10Jbond: "thans updated" [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[12:08:04] <wikibugs>	 (03PS8) 10Jbond: P:environment: Add ablilty to inject environment variables [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315)
[12:11:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[12:11:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:environment: Add ablilty to inject environment variables [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond)
[12:16:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/773476 (owner: 10Muehlenhoff)
[12:17:27] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:22:56] <wikibugs>	 (03PS1) 10Abbe98: Change foaf:homepage value from Literal to IRI [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490
[12:25:53] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:30:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10jbond) p:05Triage→03Medium
[12:33:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:34:37] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: default to deploy.sh as deployment command [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773491 (https://phabricator.wikimedia.org/T303931)
[12:35:49] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] wmcs: toolforge: k8s: default to deploy.sh as deployment command [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773491 (https://phabricator.wikimedia.org/T303931) (owner: 10Arturo Borrero Gonzalez)
[12:36:26] <wikibugs>	 (03Abandoned) 10Abbe98: Change foaf:homepage value from Literal to IRI [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773450 (owner: 10Abbe98)
[12:37:52] <wikibugs>	 10SRE, 10Data-Engineering: Adding snwachukwu@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T304541 (10jbond) 05Open→03Resolved a:03jbond This has been completed
[12:38:14] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10jbond) p:05Triage→03Medium
[12:38:24] <wikibugs>	 (03CR) 10Hoo man: [C: 04-1] "One nitpick, look's fine otherwise." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490 (owner: 10Abbe98)
[12:38:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] profile::java: Also add component/jdk on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/773476 (owner: 10Muehlenhoff)
[12:38:55] <wikibugs>	 (03CR) 10Muehlenhoff: profile::java: Also add component/jdk on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773476 (owner: 10Muehlenhoff)
[12:43:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:43:29] <wikibugs>	 (03PS20) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117)
[12:43:42] <wikibugs>	 (03CR) 10CDanis: "looks good enough to me just some nits" [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto)
[12:44:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: k8s: default to deploy.sh as deployment command [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773491 (https://phabricator.wikimedia.org/T303931) (owner: 10Arturo Borrero Gonzalez)
[12:52:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1158 for schema change', diff saved to https://phabricator.wikimedia.org/P23023 and previous config saved to /var/cache/conftool/dbconfig/20220324-125225-marostegui.json
[12:52:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:52] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "This is fine to go; any comment adjustment can be made later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran)
[12:54:14] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2001-dev.codfw.wmnet with OS bullseye
[12:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:50] <wikibugs>	 (03PS2) 10MSantos: maps: allow bbcrewind to access maps public urls [puppet] - 10https://gerrit.wikimedia.org/r/772462 (https://phabricator.wikimedia.org/T297968)
[12:56:01] <wikibugs>	 (03PS2) 10Tchanders: Add IPInfo to BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran)
[12:56:15] <wikibugs>	 (03CR) 10Tchanders: [C: 03+1] Add IPInfo to BetaFeatures (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1300).
[13:00:05] <jouncebot>	 zabe and Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:02:01] <Tchanders>	 Hi! I'm around to test if anyone is around to deploy?
[13:03:14] <zabe>	 o/
[13:04:09] * Reedy looks
[13:04:57] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Add IPInfo to BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran)
[13:05:52] <wikibugs>	 (03Merged) 10jenkins-bot: Add IPInfo to BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran)
[13:05:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Don't know if the type is overkill, so +1 with comment :)" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[13:07:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[13:07:14] <Reedy>	 Tchanders: It's on mwdebug1001
[13:07:23] <wikibugs>	 (03PS3) 10Reedy: Stop writing to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771469 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[13:07:29] <Tchanders>	 Reedy: Taking a look - thanks
[13:08:11] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/34533/netflow1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/773468 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi)
[13:09:14] <wikibugs>	 (03PS2) 10Abbe98: Change foaf:homepage value from Literal to IRI [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490
[13:09:15] <Tchanders>	 Reedy: Looks great
[13:09:22] <Reedy>	 sweet
[13:09:56] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Stop writing to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771469 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[13:10:26] <logmsgbot>	 !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T292802 (duration: 00m 50s)
[13:10:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:31] <stashbot>	 T292802: IP Info feature should be made available as a Beta feature for launch [M] - https://phabricator.wikimedia.org/T292802
[13:10:38] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771469 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[13:11:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff)
[13:11:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:11:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:44] <Reedy>	 zabe: Yours is on mwdebug1001 too... As far as we can test it ;D
[13:12:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[13:12:18] <Reedy>	 (I also double checked for usages of the wmf global)
[13:12:52] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Initial debianization of istio-cni (033 comments) [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[13:12:57] <zabe>	 Reedy, nothing seems to break and logstash looks clear, so I would say we are good to go
[13:14:02] <wikibugs>	 (03CR) 10Abbe98: "Indentation fixed." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490 (owner: 10Abbe98)
[13:15:13] <logmsgbot>	 !log reedy@deploy1002 Synchronized tests/: T45956 (duration: 00m 49s)
[13:15:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:18] <stashbot>	 T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956
[13:15:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:15:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:15:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:16:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:37] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:18:52] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: host reimage
[13:18:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:35] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: host reimage
[13:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:21:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P23024 and previous config saved to /var/cache/conftool/dbconfig/20220324-132217-root.json
[13:22:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:22:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:22:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:45] <logmsgbot>	 !log reedy@deploy1002 Synchronized multiversion/: T45956 (duration: 00m 50s)
[13:23:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:49] <stashbot>	 T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956
[13:23:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:02] <logmsgbot>	 !log reedy@deploy1002 Synchronized wmf-config/CommonSettings.php: T45956 (duration: 00m 49s)
[13:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:07] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[13:33:23] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[13:34:08] <logmsgbot>	 !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudgw2001-dev.codfw.wmnet with OS bullseye
[13:34:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:26] <wikibugs>	 (03PS3) 10JMeybohm: Add *.k8s-staging.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740)
[13:37:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P23025 and previous config saved to /var/cache/conftool/dbconfig/20220324-133721-root.json
[13:37:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:53] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH >>! In T297913#7801355, @RobH wrote: >  > So I guess this kernel change broke it entirely?  No, you were using the wrong command :-)    "perccli" is a...
[13:42:22] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10cmooney) Just to update here.  No solution as of yet, Juniper are also of the belief it is a bug in how their software processes ARPs, and the interaction be...
[13:43:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[13:43:20] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2002-dev.codfw.wmnet with OS bullseye
[13:43:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:58] <wikibugs>	 (03PS1) 10Ssingh: dnsdist: remove redundant rate limits (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/773503
[13:47:24] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10MatthewVernon) Thanks for the update, and I'm glad some progress is being made :)  From my POV, I don't need this hardware just now; so happy with it staying...
[13:48:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[13:50:43] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:51:21] <jayme>	 uh
[13:52:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P23026 and previous config saved to /var/cache/conftool/dbconfig/20220324-135225-root.json
[13:52:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:55] <wikibugs>	 (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[13:57:28] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: host reimage
[13:57:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:24] <wikibugs>	 (03PS5) 10Elukey: Initial debianization of istio-cni [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612)
[13:59:08] <wikibugs>	 (03CR) 10Elukey: Initial debianization of istio-cni (033 comments) [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[14:00:57] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: host reimage
[14:01:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:02:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] C:icinga::monitor::cloudelastic: Add checkes for certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/773214 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:03:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] C:icinga::commons: Add ssl expiry checks for commons [puppet] - 10https://gerrit.wikimedia.org/r/773217 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:03:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, also what Riccardo said" [puppet] - 10https://gerrit.wikimedia.org/r/773218 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:04:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: C:icinga::commons: Add ssl expiry checks for gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773219 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:05:44] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize deploy code into a class [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773509
[14:05:46] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize build code into a class      So we can easily reuse it easily from different cookbooks. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773510
[14:05:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] C:icinga::gitlab: Add ssl expiry checks for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/773220 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:07:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P23028 and previous config saved to /var/cache/conftool/dbconfig/20220324-140729-root.json
[14:07:32] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34535/console" [puppet] - 10https://gerrit.wikimedia.org/r/773503 (owner: 10Ssingh)
[14:07:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "Example diff on lsw1-f2:" [homer/public] - 10https://gerrit.wikimedia.org/r/773470 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi)
[14:08:24] <wikibugs>	 (03Merged) 10jenkins-bot: Add static route leak for sflow collector in EVPN setup [homer/public] - 10https://gerrit.wikimedia.org/r/773470 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi)
[14:09:33] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize build code into a class [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773510
[14:11:13] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:icinga::monitor::cloudelastic: refactor to make a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:11:34] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2002-dev.codfw.wmnet with OS bullseye
[14:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[14:13:14] <wikibugs>	 (03PS36) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454)
[14:13:39] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: remove redundant rate limits (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/773503 (owner: 10Ssingh)
[14:14:39] <wikibugs>	 (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[14:18:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[14:18:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) (owner: 10Cwhite)
[14:19:21] <wikibugs>	 (03CR) 10Jforrester: "Aha, yes, the .com is the primary for that sub-domain. I don't know if that's OK for all sub-domains, but we did that for wikimedia.org, s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773302 (https://phabricator.wikimedia.org/T304555) (owner: 10Arlolra)
[14:20:29] <icinga-wm>	 RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms
[14:21:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'm sorry I currently don't have the bandwidth to take this on (+Matthew as he might)" [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) (owner: 10Jcrespo)
[14:22:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P23029 and previous config saved to /var/cache/conftool/dbconfig/20220324-142233-root.json
[14:22:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:46] <jynus>	 backup1001: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/bacula/jobs.d/deploy2002.codfw.wmnet-/etc/helmfile-defaults/mediawiki/release-Monthly-1st-Thu-production.conf],File[/etc/bacula/jobs.d/deploy1002.eqiad.wmnet-/etc/helmfile-defaults/mediawiki/release-Monthly-1st-Tue-production.conf]
[14:23:26] <jynus>	 someone working with deploy servers?
[14:24:02] <wikibugs>	 (03PS1) 10Tchanders: Set IPInfo config for path to MaxMind files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773517 (https://phabricator.wikimedia.org/T304604)
[14:26:20] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade.
[14:26:23] <icinga-wm>	 PROBLEM - SSH on kubernetes2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:26:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:41] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) (owner: 10Cwhite)
[14:27:13] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:28:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, Cole what do you think ?" [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773456 (https://phabricator.wikimedia.org/T304585) (owner: 10Phedenskog)
[14:29:22] <wikibugs>	 (03CR) 10David Caro: "Why not expose it as a cookbook instead?" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773510 (owner: 10Arturo Borrero Gonzalez)
[14:30:52] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[14:31:35] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:31:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[14:31:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[14:31:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23030 and previous config saved to /var/cache/conftool/dbconfig/20220324-143149-marostegui.json
[14:31:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:56] <stashbot>	 T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658
[14:34:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:39] <moritzm>	 !log installing containerd updates on ml-serve*
[14:39:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:icinga::commons: Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773218 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:42:52] <wikibugs>	 (03PS3) 10Jbond: C:icinga::monitor::cloudelastic: Add checkes for certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/773214 (https://phabricator.wikimedia.org/T304321)
[14:43:05] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2035 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[14:43:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "Not opposed in theory, though given how critical (hah!) check_http is we must make sure we get some form of testing for the script going" [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:48:20] <wikibugs>	 (03PS1) 10Elukey: kubernetes: clean up extra netboot and host settings [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744)
[14:49:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:icinga::monitor::cloudelastic: Add checkes for certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/773214 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:56:33] <wikibugs>	 (03PS2) 10Elukey: kubernetes: clean up extra netboot and host settings [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744)
[14:58:02] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34538/console" [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[14:59:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:icinga::gitlab: Add ssl expiry checks for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/773220 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:59:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:icinga::commons: Add ssl expiry checks for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/773219 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[14:59:58] <wikibugs>	 (03PS2) 10Jbond: C:icinga::commons: Add ssl expiry checks for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/773219 (https://phabricator.wikimedia.org/T304321)
[15:00:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[15:00:15] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/773219 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:00:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:icinga::commons: Add ssl expiry checks for commons [puppet] - 10https://gerrit.wikimedia.org/r/773217 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:00:55] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:01:38] <wikibugs>	 (03PS3) 10Elukey: kubernetes: clean up extra netboot and host settings [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744)
[15:01:40] <wikibugs>	 (03PS2) 10Jbond: icinga: move client_auth_puppet_post to use wmf_check_http [puppet] - 10https://gerrit.wikimedia.org/r/773279 (https://phabricator.wikimedia.org/T304321)
[15:01:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:docker_registry_ha::registry:  Add ssl expiry checks [puppet] - 10https://gerrit.wikimedia.org/r/773257 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:02:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:debmonitor::server:  Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773254 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:03:03] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34539/console" [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[15:03:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:chartmuseum:  Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773249 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:03:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:noc: Add ssl expiry checks for noc [puppet] - 10https://gerrit.wikimedia.org/r/773223 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:03:50] <moritzm>	 !log installing openssl1.0 security updates on stretch
[15:03:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[15:06:03] <wikibugs>	 (03PS2) 10Jbond: C:openstack::keystone: Add ssl expiry checks for keystone [puppet] - 10https://gerrit.wikimedia.org/r/773224 (https://phabricator.wikimedia.org/T304321)
[15:06:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10Patch-For-Review: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) @MatthewVernon Filippo said he doesn't have the bandwidth to help with the patch and recommended contacting you. Could you h...
[15:09:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:lvs::monitor_services: Add ssl expiry checks for lvs [puppet] - 10https://gerrit.wikimedia.org/r/773221 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:10:11] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] C:openstack::keystone: Add ssl expiry checks for keystone [puppet] - 10https://gerrit.wikimedia.org/r/773224 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:10:26] <wikibugs>	 (03CR) 10Herron: "LGTM overall!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus)
[15:10:42] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 30): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34541/console" [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[15:12:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:openstack::keystone: Add ssl expiry checks for keystone [puppet] - 10https://gerrit.wikimedia.org/r/773224 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:12:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:icinga::commons: Add ssl expiry checks for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/773219 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:13:48] <icinga-wm>	 PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:15:00] <icinga-wm>	 RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:18:19] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: Rsyslog omkafka configs use new ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) (owner: 10Cwhite)
[15:19:52] <wikibugs>	 (03PS1) 10Jbond: P:icinga: Add ssl expiry check to external monitoring [puppet] - 10https://gerrit.wikimedia.org/r/773553 (https://phabricator.wikimedia.org/T304321)
[15:21:04] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp2033 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773250 (https://phabricator.wikimedia.org/T290005)
[15:21:06] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp2031 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773251 (https://phabricator.wikimedia.org/T290005)
[15:21:08] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp2029 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773252 (https://phabricator.wikimedia.org/T290005)
[15:21:10] <wikibugs>	 (03PS2) 10MMandere: site: Reimage cp2027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773253 (https://phabricator.wikimedia.org/T290005)
[15:21:12] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp2034 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773554 (https://phabricator.wikimedia.org/T290005)
[15:21:14] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp2032 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773555 (https://phabricator.wikimedia.org/T290005)
[15:21:16] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp2030 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773556 (https://phabricator.wikimedia.org/T290005)
[15:21:18] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp2028 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773557 (https://phabricator.wikimedia.org/T290005)
[15:21:43] <wikibugs>	 (03PS1) 10Cmjohnson: Adding ml-cache1001-3 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/773558 (https://phabricator.wikimedia.org/T299435)
[15:23:04] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:23:41] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10Papaul)
[15:24:31] <XioNoX>	 !log codfw: disable BGP to DE-CIX for link move
[15:24:32] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Unbreak director: disable deployment backups [puppet] - 10https://gerrit.wikimedia.org/r/773559
[15:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:28] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2034 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773554 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[15:25:59] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Unbreak director: disable deployment backups [puppet] - 10https://gerrit.wikimedia.org/r/773559 (https://phabricator.wikimedia.org/T299648)
[15:26:01] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2032 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773555 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[15:26:32] <icinga-wm>	 RECOVERY - SSH on kubernetes2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:26:32] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2030 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773556 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[15:27:02] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2028 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773557 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[15:28:15] <wikibugs>	 (03CR) 10Jcrespo: "I am guessing this is WIP code- so a quick comment will be the easiest to go until a more permanent solutions is available? This blocks co" [puppet] - 10https://gerrit.wikimedia.org/r/773559 (https://phabricator.wikimedia.org/T299648) (owner: 10Jcrespo)
[15:28:47] <wikibugs>	 (03PS1) 10Jbond: P:idp::client::https::site: Add check_http_expiry to idp services [puppet] - 10https://gerrit.wikimedia.org/r/773560 (https://phabricator.wikimedia.org/T304321)
[15:28:49] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding ml-cache1001-3 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/773558 (https://phabricator.wikimedia.org/T299435) (owner: 10Cmjohnson)
[15:29:29] <jynus>	 ^joe can I get a path review?
[15:30:07] <wikibugs>	 (03PS1) 10Ebernhardson: elastic: Remove noqa from rolling-operation.py [cookbooks] - 10https://gerrit.wikimedia.org/r/773561
[15:30:09] <wikibugs>	 (03PS1) 10Ebernhardson: elastic: Bring back stopping new replicas during restart [cookbooks] - 10https://gerrit.wikimedia.org/r/773562
[15:31:25] <wikibugs>	 (03PS1) 10Jbond: P:librenms::web: add check_https_expiry [puppet] - 10https://gerrit.wikimedia.org/r/773563 (https://phabricator.wikimedia.org/T304321)
[15:32:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson)
[15:32:47] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backy2: add link to the runbook for backup_vms [puppet] - 10https://gerrit.wikimedia.org/r/772839 (https://phabricator.wikimedia.org/T304408) (owner: 10David Caro)
[15:32:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable Ganeti 3 for ganeti-test* [puppet] - 10https://gerrit.wikimedia.org/r/773564
[15:33:02] <wikibugs>	 (03PS1) 10Elukey: Add helmfile config for Istio proxy sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612)
[15:33:40] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:34:23] <wikibugs>	 (03PS1) 10Jbond: P:lists::monitoring: Add check_https_expiry check [puppet] - 10https://gerrit.wikimedia.org/r/773566 (https://phabricator.wikimedia.org/T304321)
[15:34:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson)
[15:35:12] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:35:31] <wikibugs>	 (03CR) 10Jcrespo: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto)
[15:37:32] <wikibugs>	 (03PS1) 10Jbond: P:microsites::peopleweb: add check_http_expiry monitor [puppet] - 10https://gerrit.wikimedia.org/r/773567 (https://phabricator.wikimedia.org/T304321)
[15:38:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[15:39:25] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10UploadWizard, 10Tracking-Neverending: Uploadstash errors (tracking) - https://phabricator.wikimedia.org/T85568 (10Krinkle)
[15:39:39] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1001.eqiad.wmnet with OS bullseye
[15:39:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1001.eqiad.wmnet wit...
[15:39:59] <wikibugs>	 (03PS4) 10Elukey: role::kafka::logging: add PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130)
[15:40:00] <wikibugs>	 (03PS1) 10Jbond: P:netbox: add check_https_expiry [puppet] - 10https://gerrit.wikimedia.org/r/773568 (https://phabricator.wikimedia.org/T304321)
[15:40:08] <wikibugs>	 (03CR) 10Elukey: role::kafka::logging: add PKI migration settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[15:41:06] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1142.mgmt.eqiad.wmnet with reboot policy FORCED
[15:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[15:44:28] <wikibugs>	 (03PS2) 10Arlolra: Add wikimedia.com to wgNoFollowDomainExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773302 (https://phabricator.wikimedia.org/T304555)
[15:45:35] <wikibugs>	 (03PS1) 10Jbond: P:phabricator: add check_expiry for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/773571 (https://phabricator.wikimedia.org/T304321)
[15:49:16] <wikibugs>	 (03PS1) 10Jbond: P:icinga::debmonitor: correct check definition [puppet] - 10https://gerrit.wikimedia.org/r/773573 (https://phabricator.wikimedia.org/T304321)
[15:49:35] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] P:icinga::debmonitor: correct check definition [puppet] - 10https://gerrit.wikimedia.org/r/773573 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:51:30] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Unbreak director: disable deployment backups [puppet] - 10https://gerrit.wikimedia.org/r/773559 (https://phabricator.wikimedia.org/T299648)
[15:51:53] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1001.eqiad.wmnet with reason: host reimage
[15:51:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:51] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617 (10elukey) Adding some context for the Traffic team. There were two varnishkafka versions, one in the `main` component and one in `component/varnish6` of `buster-wikimedia` at the time...
[15:56:53] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-cache1001.eqiad.wmnet with reason: host reimage
[15:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:28] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617 (10BBlack) Thanks for making this ticket and adding those insights!  I agree, there have been multiple times in the past that we've had problems in this area, and we should probably pup...
[16:00:04] <jouncebot>	 jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[16:02:17] <wikibugs>	 (03CR) 10JMeybohm: bacula: Unbreak director: disable deployment backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773559 (https://phabricator.wikimedia.org/T299648) (owner: 10Jcrespo)
[16:03:57] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:07:15] <wikibugs>	 (03CR) 10Razzi: "Am I understanding Type=notify correctly? See commit message" [puppet] - 10https://gerrit.wikimedia.org/r/773387 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi)
[16:07:16] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1001.eqiad.wmnet with OS bullseye
[16:07:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:19] <wikibugs>	 (03CR) 10Jcrespo: bacula: Unbreak director: disable deployment backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773559 (https://phabricator.wikimedia.org/T299648) (owner: 10Jcrespo)
[16:07:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1001.eqiad.wmnet with OS...
[16:07:49] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1142.mgmt.eqiad.wmnet with reboot policy FORCED
[16:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:01] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye
[16:09:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet wit...
[16:09:20] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED
[16:09:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:04] <brennen>	 jouncebot nowandnext
[16:12:04] <jouncebot>	 For the next 0 hour(s) and 47 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1600)
[16:12:04] <jouncebot>	 In 1 hour(s) and 47 minute(s): 🚂🧪Trainsperiment Week Deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1800)
[16:12:37] <wikibugs>	 (03CR) 10BBlack: [C: 04-1] "Looking pretty good overall, a couple of comments inline here (maybe remove the TODO part entirely too, if you agree).  We should definite" [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx)
[16:12:53] <brennen>	 current window appears clear, train's unblocked, we're going ahead to all wikis with wmf.4
[16:13:27] <brennen>	 !log trainsperiment (T300203): blockers clear, logs triaged, rolling 1.39.0-wmf.4 out to all wikis again
[16:13:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:33] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[16:15:48] <wikibugs>	 (03PS3) 10RLazarus: slo: Move most of the text panel content to a description field, so it can be overridden [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842)
[16:16:27] <wikibugs>	 (03PS1) 10Brennen Bearnes: group0 wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773579
[16:16:28] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773579 (owner: 10Brennen Bearnes)
[16:16:34] <wikibugs>	 (03CR) 10RLazarus: slo: Move most of the text panel content to a description field, so it can be overridden (032 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus)
[16:17:12] <wikibugs>	 (03PS1) 10Jbond: P:openstack: use correct vhost when checking sl expiry [puppet] - 10https://gerrit.wikimedia.org/r/773580
[16:17:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:openstack: use correct vhost when checking sl expiry [puppet] - 10https://gerrit.wikimedia.org/r/773580 (owner: 10Jbond)
[16:18:12] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1003.eqiad.wmnet with OS bullseye
[16:18:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye
[16:18:50] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773579 (owner: 10Brennen Bearnes)
[16:19:00] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED
[16:19:00] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED
[16:19:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:28] <wikibugs>	 10SRE, 10Observability-Metrics, 10observability, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10fgiunchedi) 05Open→03Declined I believe with {T240685} in mediawiki (i.e. Prometheus / generic tags support) this can be declined (though feel fr...
[16:19:52] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1144.mgmt.eqiad.wmnet with reboot policy FORCED
[16:19:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:33] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.4  refs T300203
[16:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:39] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[16:20:51] <wikibugs>	 (03PS4) 10Krinkle: Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[16:21:16] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage
[16:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: backup: fix filesets definition for mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/773581 (https://phabricator.wikimedia.org/T299648)
[16:21:32] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/771560 (owner: 10Muehlenhoff)
[16:21:36] <wikibugs>	 (03PS1) 10Brennen Bearnes: group1 wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773582
[16:21:38] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773582 (owner: 10Brennen Bearnes)
[16:21:40] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "TestServices.php and TestServices.php still set wmfMasterServices, that one and other can be removed from there as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[16:22:43] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773582 (owner: 10Brennen Bearnes)
[16:23:33] <wikibugs>	 10SRE, 10Observability-Metrics: Stop using public (cached) endpoints for checks on graphite - https://phabricator.wikimedia.org/T219902 (10fgiunchedi) 05Open→03Declined Boldly declining this since graphite is in life support mode and the lowest hanging fruits have been addressed (thanks!)
[16:23:55] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:24:17] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.4  refs T300203
[16:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:18] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage
[16:25:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:25] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.4  refs T300203 (duration: 01m 06s)
[16:25:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:25:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see inline for a nit" [puppet] - 10https://gerrit.wikimedia.org/r/773553 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[16:25:45] <wikibugs>	 (03CR) 10Jcrespo: "Looks fine, I only have one question- general_dir is not currently defined? Could it in the future be different between, eg. eqiad and cod" [puppet] - 10https://gerrit.wikimedia.org/r/773581 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto)
[16:26:00] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED
[16:26:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:26:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:26:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:28] <wikibugs>	 (03PS2) 10Jbond: P:icinga: Add ssl expiry check to external monitoring [puppet] - 10https://gerrit.wikimedia.org/r/773553 (https://phabricator.wikimedia.org/T304321)
[16:26:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:46] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/773553 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[16:26:50] <wikibugs>	 (03Abandoned) 10Jcrespo: bacula: Unbreak director: disable deployment backups [puppet] - 10https://gerrit.wikimedia.org/r/773559 (https://phabricator.wikimedia.org/T299648) (owner: 10Jcrespo)
[16:27:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:27:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:29] <wikibugs>	 (03PS1) 10Brennen Bearnes: all wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773583
[16:27:31] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773583 (owner: 10Brennen Bearnes)
[16:28:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] P:icinga: Add ssl expiry check to external monitoring [puppet] - 10https://gerrit.wikimedia.org/r/773553 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[16:28:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] backup: fix filesets definition for mw on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773581 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto)
[16:28:21] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.4  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773583 (owner: 10Brennen Bearnes)
[16:29:53] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.4  refs T300203
[16:29:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:59] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[16:30:25] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] backup: fix filesets definition for mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/773581 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto)
[16:30:32] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1003.eqiad.wmnet with reason: host reimage
[16:30:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:07] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1145.mgmt.eqiad.wmnet with reboot policy FORCED
[16:31:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:33:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:33:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:03] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1003.eqiad.wmnet with reason: host reimage
[16:34:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:icinga: Add ssl expiry check to external monitoring [puppet] - 10https://gerrit.wikimedia.org/r/773553 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[16:34:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::kafka::logging: add PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[16:34:49] <elukey>	 jbond: ok to merge?
[16:35:03] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] map Spain to drmrs [dns] - 10https://gerrit.wikimedia.org/r/773244 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack)
[16:35:06] <jbond>	 please do elukey 
[16:35:23] <elukey>	 done!
[16:35:26] <jbond>	 thx
[16:35:34] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: keepalived: use version from bullseye-bpo [puppet] - 10https://gerrit.wikimedia.org/r/773585 (https://phabricator.wikimedia.org/T304598)
[16:35:36] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: don't install kernel or nft from backports [puppet] - 10https://gerrit.wikimedia.org/r/773586 (https://phabricator.wikimedia.org/T304598)
[16:35:49] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1002.eqiad.wmnet with OS bullseye
[16:35:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye completed: -...
[16:36:26] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1144.mgmt.eqiad.wmnet with reboot policy FORCED
[16:36:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:40] <wikibugs>	 10SRE, 10Observability-Logging, 10Privacy Engineering, 10Wikimedia-Logstash, and 2 others: Production logstash should be protected by two-factor auth, at the least - https://phabricator.wikimedia.org/T237630 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving in favor of {T246998} since that'll...
[16:38:11] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06452 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:38:22] <wikibugs>	 10SRE, 10Observability-Logging: Monitor the BMC's event log for hardware errors - https://phabricator.wikimedia.org/T136311 (10fgiunchedi) Mentioning {T302639} here too since the two are related
[16:38:46] <wikibugs>	 (03PS1) 10Cathal Mooney: Add template to configure IPv6 RAs on CRs and L3 Switches [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758)
[16:39:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:phabricator: add check_expiry for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/773571 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[16:39:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:netbox: add check_https_expiry [puppet] - 10https://gerrit.wikimedia.org/r/773568 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[16:39:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:microsites::peopleweb: add check_http_expiry monitor [puppet] - 10https://gerrit.wikimedia.org/r/773567 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[16:40:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:lists::monitoring: Add check_https_expiry check [puppet] - 10https://gerrit.wikimedia.org/r/773566 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[16:40:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:librenms::web: add check_https_expiry [puppet] - 10https://gerrit.wikimedia.org/r/773563 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[16:41:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/773586 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez)
[16:42:02] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1146.mgmt.eqiad.wmnet with reboot policy FORCED
[16:42:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:08] <wikibugs>	 10SRE, 10Observability-Logging: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10MoritzMuehlenhoff) Since we've replaced Kibana with Opensearch Dashboards we now actually can use OIDC or SAML it seems: https://opensearch.org/docs/latest/security-plugin/configuration/openid-connect/ https://...
[16:44:19] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1003.eqiad.wmnet with OS bullseye
[16:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye completed: -...
[16:46:32] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10colewhite) @elukey This issue hasn't reappeared since we began dropping the field. If you're ok with keeping this mitigation in place, please feel free to c...
[16:47:26] <wikibugs>	 (03CR) 10JMeybohm: Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[16:47:56] <wikibugs>	 (03CR) 10JMeybohm: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[16:49:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:49:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:23] <wikibugs>	 (03PS1) 10Urbanecm: logos: add commons filename for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773590
[16:49:25] <wikibugs>	 (03PS1) 10Urbanecm: fawiki: Set new year celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773591 (https://phabricator.wikimedia.org/T304314)
[16:49:28] <wikibugs>	 (03PS37) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454)
[16:51:02] <urbanecm>	 jouncebot: nowandnext
[16:51:02] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1600)
[16:51:02] <jouncebot>	 In 1 hour(s) and 8 minute(s): 🚂🧪Trainsperiment Week Deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1800)
[16:51:43] <wikibugs>	 (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[16:52:29] <urbanecm>	 jbond: rzl: is anything happening in puppet window?
[16:52:33] <urbanecm>	 or can i do a quick mw deploy?
[16:55:26] <wikibugs>	 (03PS1) 10Ladsgroup: Enable WRITE BOTH for templatelinks normalization in more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773594 (https://phabricator.wikimedia.org/T299421)
[16:55:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:55:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:55:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:48] <rzl>	 urbanecm: nope, all yours
[16:55:52] <urbanecm>	 thanks
[16:56:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] logos: add commons filename for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773590 (owner: 10Urbanecm)
[16:56:46] <wikibugs>	 (03CR) 10Jbond: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[16:57:18] <wikibugs>	 (03Merged) 10jenkins-bot: logos: add commons filename for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773590 (owner: 10Urbanecm)
[16:57:49] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] envoy: Move upstream HTTP config into the new HttpProtocolOptions message [puppet] - 10https://gerrit.wikimedia.org/r/773364 (https://phabricator.wikimedia.org/T303230) (owner: 10RLazarus)
[16:57:57] <wikibugs>	 (03PS1) 10MSantos: mobileapps: Bunp to 2022-03-24-135848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/773595
[16:58:21] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1146.mgmt.eqiad.wmnet with reboot policy FORCED
[16:58:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:25] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1145.mgmt.eqiad.wmnet with reboot policy FORCED
[16:59:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:51] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED
[16:59:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] fawiki: Set new year celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773591 (https://phabricator.wikimedia.org/T304314) (owner: 10Urbanecm)
[17:00:10] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1148.mgmt.eqiad.wmnet with reboot policy FORCED
[17:00:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[17:00:37] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10elukey) 05Open→03Resolved I am yes! Thanks a lot for the support!
[17:00:58] <wikibugs>	 (03Merged) 10jenkins-bot: fawiki: Set new year celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773591 (https://phabricator.wikimedia.org/T304314) (owner: 10Urbanecm)
[17:01:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:01:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:25] <wikibugs>	 (03CR) 10Jbond: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[17:02:32] <wikibugs>	 (03CR) 10MVernon: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[17:03:01] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1148.mgmt.eqiad.wmnet with reboot policy FORCED
[17:03:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:22] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED
[17:03:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:56] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4516 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:04:04] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 05d55a9: fawiki: Set new year celebration (T304314; 1/3) (duration: 00m 50s)
[17:04:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson)
[17:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:08] <stashbot>	 T304314: Requesting temporary logo change for fa.wikipedia.org - https://phabricator.wikimedia.org/T304314
[17:04:40] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10KFrancis) @jbond The agreement has been signed.  Please proceed with the access request.  Thanks!
[17:05:20] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) 05In progress→03Resolved
[17:05:26] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus)
[17:06:04] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 05d55a9: fawiki: Set new year celebration (T304314; 2/3) (duration: 00m 49s)
[17:06:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:37] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: Bring back stopping new replicas during restart [cookbooks] - 10https://gerrit.wikimedia.org/r/773562 (owner: 10Ebernhardson)
[17:06:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:06:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:11] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized logos/config.yaml: 05d55a9: fawiki: Set new year celebration (T304314; 3/3) (duration: 00m 49s)
[17:07:11] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: Remove noqa from rolling-operation.py [cookbooks] - 10https://gerrit.wikimedia.org/r/773561 (owner: 10Ebernhardson)
[17:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:16] * urbanecm done
[17:09:24] <icinga-wm>	 PROBLEM - Maps - OSM synchronization lag - codfw on alert1001 is CRITICAL: 2.598e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=12
[17:10:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[17:10:12] <wikibugs>	 (03Merged) 10jenkins-bot: elastic: Remove noqa from rolling-operation.py [cookbooks] - 10https://gerrit.wikimedia.org/r/773561 (owner: 10Ebernhardson)
[17:10:12] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED
[17:10:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:21] <wikibugs>	 (03Merged) 10jenkins-bot: elastic: Bring back stopping new replicas during restart [cookbooks] - 10https://gerrit.wikimedia.org/r/773562 (owner: 10Ebernhardson)
[17:10:38] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED
[17:10:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:10:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:10:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED
[17:11:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:30] <icinga-wm>	 PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.599e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11
[17:11:58] <wikibugs>	 10SRE, 10Analytics, 10Data-Engineering, 10Event-Platform, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10BTullis)
[17:12:04] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED
[17:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:05] <wikibugs>	 (03PS7) 10Phuedx: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238)
[17:15:30] <wikibugs>	 (03CR) 10Phuedx: Request high-entropy Sec-CH-UA* client hints (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx)
[17:15:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) 1143, 1147 and 1148 did not respond to the provision script
[17:15:46] <wikibugs>	 (03CR) 10Bking: [C: 03+2] [wdqs] test jvmquake options on the public cluster [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse)
[17:18:19] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:20:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[17:23:23] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.1699 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[17:23:26] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10EChetty)
[17:23:32] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] mobileapps: Bunp to 2022-03-24-135848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/773595 (owner: 10MSantos)
[17:24:32] <elukey>	 mmm there seems to be a big set of failures for the Exec verify-envoy-config
[17:24:59] <elukey>	 rzl: --^
[17:26:16] <wikibugs>	 (03PS5) 10Zabe: Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956)
[17:27:21] <elukey>	 Proto constraint validation failed (field: "upstream_protocol_options", reason: is required)
[17:27:37] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4032 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:28:38] <elukey>	 I guess it is https://gerrit.wikimedia.org/r/c/operations/puppet/+/773364 ?
[17:29:09] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Bunp to 2022-03-24-135848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/773595 (owner: 10MSantos)
[17:29:56] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "I'm not 100% sure, but this looks correct to me, you don't need Notify" [puppet] - 10https://gerrit.wikimedia.org/r/773387 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi)
[17:30:03] <wikibugs>	 (03CR) 10Jbond: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[17:30:11] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:30:56] <wikibugs>	 (03PS1) 10Andrew Bogott: toolsbeta: update nfs server location [puppet] - 10https://gerrit.wikimedia.org/r/773601
[17:31:55] <wikibugs>	 (03PS1) 10Zabe: Use $wmgUseRestbaseVRS in comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773602 (https://phabricator.wikimedia.org/T45956)
[17:31:58] <wikibugs>	 (03PS2) 10Cathal Mooney: Add template to configure IPv6 RAs on CRs and L3 Switches [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758)
[17:32:19] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[17:32:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:52] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[17:32:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[17:34:14] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[17:34:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[17:34:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[17:34:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[17:34:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[17:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23035 and previous config saved to /var/cache/conftool/dbconfig/20220324-173450-marostegui.json
[17:34:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:02] <stashbot>	 T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658
[17:35:02] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[17:35:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:04] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[17:36:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:31] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.restart
[17:36:31] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99)
[17:36:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:37] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06452 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:36:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P23036 and previous config saved to /var/cache/conftool/dbconfig/20220324-173638-root.json
[17:36:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:57] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[17:36:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] toolsbeta: update nfs server location [puppet] - 10https://gerrit.wikimedia.org/r/773601 (owner: 10Andrew Bogott)
[17:39:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[17:41:04] <wikibugs>	 (03CR) 10Ladsgroup: "Hear me out." [software] - 10https://gerrit.wikimedia.org/r/773440 (https://phabricator.wikimedia.org/T303605) (owner: 10Marostegui)
[17:41:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:idp::client::https::site: Add check_http_expiry to idp services [puppet] - 10https://gerrit.wikimedia.org/r/773560 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[17:42:34] <wikibugs>	 (03PS4) 10Ladsgroup: orchestrator: Use macros in apache config. [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond)
[17:42:41] <wikibugs>	 (03PS5) 10Ladsgroup: orchestrator: Use macros in apache config [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond)
[17:42:41] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3871 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:43:00] <wikibugs>	 (03PS6) 10Ladsgroup: orchestrator: Use macros in apache config [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond)
[17:43:06] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] orchestrator: Use macros in apache config [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond)
[17:44:32] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.restart
[17:44:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:46] <wikibugs>	 10SRE, 10Data-Engineering-Radar, 10Traffic: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617 (10EChetty)
[17:50:43] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM 👍" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus)
[17:51:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P23037 and previous config saved to /var/cache/conftool/dbconfig/20220324-175142-root.json
[17:51:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson)
[17:55:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) 05Open→03Resolved Completed
[17:57:09] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS buster
[17:57:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster
[17:58:40] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster
[17:58:40] <wikibugs>	 (03PS1) 10Zabe: Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956)
[17:58:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:42] <wikibugs>	 (03PS1) 10Zabe: Migrate $wmfAllServices to $wmgAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773608 (https://phabricator.wikimedia.org/T45956)
[17:58:46] <wikibugs>	 (03PS1) 10Zabe: Stop writing to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773609 (https://phabricator.wikimedia.org/T45956)
[17:58:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster
[17:58:58] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons.
[17:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[17:59:22] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1144.eqiad.wmnet with OS buster
[17:59:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster
[17:59:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Stop writing to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773609 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[17:59:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1145.eqiad.wmnet with OS buster
[18:00:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:05] <jouncebot>	 dancy, hashar, brennen, dduvall, jeena, and jnuche: That opportune time is upon us again. Time for a 🚂🧪Trainsperiment Week Deploy deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1800).
[18:00:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster
[18:01:13] <hashar>	 dancy: dduvall: jeena: brennen: so we do another extra train today? :-]
[18:01:30] <brennen>	 nope, we're done. :)
[18:01:33] <dduvall>	 no sir
[18:01:52] <wikibugs>	 (03CR) 10Jbond: "in case you missed it this changed cause verify-envoy-config to fail, there is some additional chat in #w-serviceops" [puppet] - 10https://gerrit.wikimedia.org/r/773364 (https://phabricator.wikimedia.org/T303230) (owner: 10RLazarus)
[18:02:23] <dduvall>	 4 trains in 4 days 🌅🤠🐎
[18:03:18] <wikibugs>	 (03CR) 10RLazarus: envoy: Move upstream HTTP config into the new HttpProtocolOptions message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773364 (https://phabricator.wikimedia.org/T303230) (owner: 10RLazarus)
[18:03:37] <wikibugs>	 (03PS1) 10RLazarus: Revert "envoy: Move upstream HTTP config into the new HttpProtocolOptions message" [puppet] - 10https://gerrit.wikimedia.org/r/773532
[18:05:26] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99)
[18:05:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:54] <wikibugs>	 (03PS2) 10Zabe: Migrate $wmfAllServices to $wmgAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773608 (https://phabricator.wikimedia.org/T45956)
[18:06:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P23038 and previous config saved to /var/cache/conftool/dbconfig/20220324-180646-root.json
[18:06:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:52] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS buster
[18:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:00] <wikibugs>	 (03PS2) 10Zabe: Stop writing to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773609 (https://phabricator.wikimedia.org/T45956)
[18:07:24] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Revert "envoy: Move upstream HTTP config into the new HttpProtocolOptions message" [puppet] - 10https://gerrit.wikimedia.org/r/773532 (owner: 10RLazarus)
[18:07:45] <wikibugs>	 (03PS1) 10Jbond: P:idp::client::httpd: fix expiry check [puppet] - 10https://gerrit.wikimedia.org/r/773611
[18:07:50] <wikibugs>	 (03PS2) 10RLazarus: Revert "envoy: Move upstream HTTP config into the new HttpProtocolOptions message" [puppet] - 10https://gerrit.wikimedia.org/r/773532 (https://phabricator.wikimedia.org/T303230)
[18:08:01] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1147.eqiad.wmnet with OS buster
[18:08:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] P:idp::client::httpd: fix expiry check [puppet] - 10https://gerrit.wikimedia.org/r/773611 (owner: 10Jbond)
[18:08:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster
[18:08:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster
[18:08:30] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) 05Resolved→03In progress
[18:09:25] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1148.eqiad.wmnet with OS buster
[18:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:01] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Use $wmgUseRestbaseVRS in comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773602 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[18:10:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster
[18:12:51] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: introduce cookbook to build/deploy all k8s components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612
[18:14:47] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001593 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[18:15:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson)
[18:17:15] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons.
[18:17:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize build code into a class (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773510 (owner: 10Arturo Borrero Gonzalez)
[18:17:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:35] <wikibugs>	 (03PS1) 10Cmjohnson: Updating site.pp for an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/773613 (https://phabricator.wikimedia.org/T293922)
[18:17:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[18:17:50] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:18:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Updating site.pp for an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/773613 (https://phabricator.wikimedia.org/T293922) (owner: 10Cmjohnson)
[18:18:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:20:02] <wikibugs>	 (03PS2) 10Cmjohnson: Updating site.pp for an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/773613 (https://phabricator.wikimedia.org/T293922)
[18:20:04] <wikibugs>	 (03PS3) 10Krinkle: static.php: Fold "current" handling into "nohash" and extend TTL to 1y [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771357 (https://phabricator.wikimedia.org/T302465)
[18:20:08] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[18:21:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P23039 and previous config saved to /var/cache/conftool/dbconfig/20220324-182150-root.json
[18:21:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:23] <wikibugs>	 (03PS1) 10Razzi: wikireplicas: remove wb_changes_dispatch view for dropped table [puppet] - 10https://gerrit.wikimedia.org/r/773614 (https://phabricator.wikimedia.org/T304591)
[18:24:01] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Updating site.pp for an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/773613 (https://phabricator.wikimedia.org/T293922) (owner: 10Cmjohnson)
[18:24:03] <wikibugs>	 (03Abandoned) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond)
[18:24:04] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:24:12] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01613 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[18:24:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773585 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez)
[18:26:30] <wikibugs>	 (03Restored) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond)
[18:26:35] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1142.eqiad.wmnet with OS buster
[18:26:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmn...
[18:26:51] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1143.eqiad.wmnet with OS buster
[18:26:53] <wikibugs>	 (03PS1) 10Zabe: filtered_tables.txt: remove gu_enabled and gu_enabled_method columns [puppet] - 10https://gerrit.wikimedia.org/r/773616 (https://phabricator.wikimedia.org/T303266)
[18:26:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmn...
[18:28:08] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) Raid testing.  I can poll the controller for basic info: root@dumpsdata1007:~# perccli64 /c0/dall show and get BBU info: perccli64 /c0/bbu show all perccli64 /c0/d0 show  I don't get how to poll fo...
[18:28:11] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1145.eqiad.wmnet with OS buster
[18:28:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmn...
[18:28:46] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1144.eqiad.wmnet with OS buster
[18:28:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmn...
[18:31:19] <wikibugs>	 (03PS4) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930
[18:33:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond)
[18:34:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:35:03] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1146.eqiad.wmnet with OS buster
[18:35:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmn...
[18:35:54] <wikibugs>	 (03PS5) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930
[18:36:02] <wikibugs>	 (03CR) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond)
[18:36:28] <razzi>	 !log razzi@deneb:~$ sudo docker system prune (reclaimed 33GB)
[18:36:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:52] <wikibugs>	 (03PS38) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454)
[18:36:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P23040 and previous config saved to /var/cache/conftool/dbconfig/20220324-183654-root.json
[18:36:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond)
[18:37:33] <wikibugs>	 (03PS6) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930
[18:37:37] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] filtered_tables.txt: remove gu_enabled and gu_enabled_method columns [puppet] - 10https://gerrit.wikimedia.org/r/773616 (https://phabricator.wikimedia.org/T303266) (owner: 10Zabe)
[18:38:12] <wikibugs>	 (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[18:38:30] <wikibugs>	 (03PS7) 10Jbond: R:tlsproxy: Add missing documentation and remove some v2/v3 compat [puppet] - 10https://gerrit.wikimedia.org/r/771930
[18:44:32] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS buster
[18:44:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad...
[18:46:08] <wikibugs>	 (03PS8) 10Jbond: R:tlsproxy: Add missing documentation and remove some v2/v3 compat [puppet] - 10https://gerrit.wikimedia.org/r/771930
[18:47:18] <wikibugs>	 (03PS9) 10Jbond: R:tlsproxy: Add missing documentation and remove some v2/v3 compat [puppet] - 10https://gerrit.wikimedia.org/r/771930
[18:47:40] <wikibugs>	 10SRE: Automatically prune docker to clear disk space on deneb.codfw.wmnet - https://phabricator.wikimedia.org/T304644 (10razzi)
[18:48:48] <wikibugs>	 (03PS1) 10Razzi: package_builder: run docker prune on a timer [puppet] - 10https://gerrit.wikimedia.org/r/773622 (https://phabricator.wikimedia.org/T304644)
[18:49:45] <wikibugs>	 (03PS1) 10Andrew Bogott: toolsbeta: update nfs server location [puppet] - 10https://gerrit.wikimedia.org/r/773623
[18:49:54] <icinga-wm>	 RECOVERY - Disk space on deneb is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops
[18:50:20] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34543/console" [puppet] - 10https://gerrit.wikimedia.org/r/773622 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi)
[18:50:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:51:44] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "As discussed in #-sre" [puppet] - 10https://gerrit.wikimedia.org/r/773622 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi)
[18:52:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] toolsbeta: update nfs server location [puppet] - 10https://gerrit.wikimedia.org/r/773623 (owner: 10Andrew Bogott)
[18:53:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "given this is like existing modules/profile/manifests/ci/docker.pp it looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/773622 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi)
[18:56:55] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] maps: allow bbcrewind to access maps public urls [puppet] - 10https://gerrit.wikimedia.org/r/772462 (https://phabricator.wikimedia.org/T297968) (owner: 10MSantos)
[18:57:01] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "LGTM as long as PCC is still happy, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond)
[18:59:11] <wikibugs>	 (03CR) 10Razzi: [V: 03+1 C: 03+2] package_builder: run docker prune on a timer [puppet] - 10https://gerrit.wikimedia.org/r/773622 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi)
[19:00:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] R:tlsproxy: Add missing documentation and remove some v2/v3 compat [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond)
[19:00:32] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:01:54] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:02:10] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1142.eqiad.wmnet with OS buster
[19:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmn...
[19:02:49] <wikibugs>	 10SRE, 10Patch-For-Review: Automatically prune docker to clear disk space on deneb.codfw.wmnet - https://phabricator.wikimedia.org/T304644 (10razzi) 05Open→03Resolved a:03razzi Timers are present!  ` razzi@deneb:~$ systemctl list-timers | grep docker ... Fri 2022-03-25 03:58:40 UTC  8h left             n...
[19:06:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 24): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34544/console" [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond)
[19:07:10] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:13:52] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3871 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:20:18] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:20:34] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1147.eqiad.wmnet with OS buster
[19:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster exec...
[19:21:58] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1148.eqiad.wmnet with OS buster
[19:22:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster exec...
[19:22:30] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_imagecatalog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:27:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23041 and previous config saved to /var/cache/conftool/dbconfig/20220324-192741-marostegui.json
[19:27:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:48] <stashbot>	 T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658
[19:33:36] <wikibugs>	 (03CR) 10Dzahn: "I like that we avoid spamming the channel, I agree as well that "could not load file" should be a warning. The only concern I have that th" [puppet] - 10https://gerrit.wikimedia.org/r/767729 (https://phabricator.wikimedia.org/T302832) (owner: 10Filippo Giunchedi)
[19:35:27] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye
[19:35:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS b...
[19:42:28] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[19:42:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P23042 and previous config saved to /var/cache/conftool/dbconfig/20220324-194246-marostegui.json
[19:42:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:36] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:44:46] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:45:21] <icinga-wm>	 ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott I think this is just a wdqs person cleaning up https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[19:49:01] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.371 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:53:06] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[19:56:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] hieradata: remove unused deployment-prep scap targets [puppet] - 10https://gerrit.wikimedia.org/r/770880 (owner: 10Majavah)
[19:57:18] <dancy>	 Welcome back dzahn!
[19:57:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P23043 and previous config saved to /var/cache/conftool/dbconfig/20220324-195752-marostegui.json
[19:57:53] <mutante>	 thanks dancy 
[19:57:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:08] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1024.eqiad.wmnet DHCP problems - https://phabricator.wikimedia.org/T303773 (10Andrew) 05Open→03Resolved
[20:00:05] <jouncebot>	 brennen: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T2000). Please do the needful.
[20:00:05] <jouncebot>	 jan_drewniak, Lucas_WMDE, and zabe: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:10] <zabe>	 o/
[20:01:29] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailman3: 550-Support for list subscription via email has been disabled. - https://phabricator.wikimedia.org/T303888 (10Ladsgroup) Yup, this is something we carried over from mailman2 given the history of abuse with mass subscription via email. Where is it being advertised?
[20:01:50] <thcipriani>	 howdy zabe 
[20:01:53] <RhinosF1>	 Hey mutante
[20:01:56] <brennen>	 wb mutante!
[20:02:44] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Use $wmgUseRestbaseVRS in comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773602 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:03:33] <wikibugs>	 (03Merged) 10jenkins-bot: Use $wmgUseRestbaseVRS in comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773602 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:03:38] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye
[20:03:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bulls...
[20:03:45] <mutante>	 brennen: RhinosF1: :) *wave*
[20:03:53] <brennen>	 jan_drewniak: around?
[20:04:06] <RhinosF1>	 How was your holiday mutante
[20:04:31] <thcipriani>	 tab completion is failing me for Lucas_WMDE
[20:04:49] <RhinosF1>	 He's not here
[20:05:05] <mutante>	 @seen Lucas_WMDE
[20:05:10] <RhinosF1>	 Left 45 min ago thcipriani
[20:05:24] <RhinosF1>	 19:13:47 ⇐︎ Lucas_WMDE quit (~Lucas_WMD@user/lucas-wmde/x-3192532): Quit: Lucas_WMDE
[20:05:29] <mutante>	 RhinosF1: pretty good, thank you
[20:05:45] <mutante>	 !seen Lucas_WMDE
[20:05:56] <mutante>	 I keep forgetting the right prefix but it used to work
[20:06:46] <RhinosF1>	 @ping
[20:06:54] <RhinosF1>	 wm-bot: hi
[20:07:05] <wm-bot>	 http://wm-bot.wmcloud.org/dump/%23wikimedia-operations.htm
[20:07:05] <RhinosF1>	 @info
[20:07:22] <RhinosF1>	 @seen mutante
[20:07:27] <RhinosF1>	 Hmm
[20:07:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:07:28] <wikibugs>	 10SRE, 10Data-Engineering-Radar, 10Traffic: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617 (10odimitrijevic)
[20:07:28] <RhinosF1>	 Weird
[20:07:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:36] <mutante>	 yea, we had this feature in the past
[20:07:51] <mutante>	 not 100% sure if it was from wm-bot
[20:08:16] <RhinosF1>	 @helo
[20:08:18] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:773602|Use $wmgUseRestbaseVRS in comment (T45956)]] (duration: 01m 05s)
[20:08:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:22] <stashbot>	 T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956
[20:08:23] <wm-bot>	 I am running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 2.8.1.0 [libirc v. 1.0.3] my source code is licensed under GPL and located at https://github.com/benapetr/wikimedia-bot I will be very happy if you fix my bugs or implement new features
[20:08:23] <RhinosF1>	 @help
[20:08:41] <RhinosF1>	 mutante: it's pm only
[20:08:52] <mutante>	 RhinosF1: aha! thanks
[20:09:18] <jan_drewniak>	 brennen: hi, sorry I'm late, I can do my deploy at the end
[20:09:52] <brennen>	 thanks jan_drewniak.  ^ cc: thcipriani 
[20:11:04] <thcipriani>	 zabe: for 768255 > Some of these do have non-variable usage, such as in the hook for siteinfo API, as used in Puppet code. — does that mean these are *still used* in puppet code?
[20:11:58] <wikibugs>	 (03PS3) 10Thcipriani: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773380 (https://phabricator.wikimedia.org/T282012) (owner: 10Jdrewniak)
[20:12:12] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773380 (https://phabricator.wikimedia.org/T282012) (owner: 10Jdrewniak)
[20:12:40] <zabe>	 thcipriani: the fields for the siteinfo API are set here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/CommonSettings.php#552
[20:12:45] <zabe>	 I don't touch that part
[20:12:57] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773380 (https://phabricator.wikimedia.org/T282012) (owner: 10Jdrewniak)
[20:12:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23044 and previous config saved to /var/cache/conftool/dbconfig/20220324-201257-marostegui.json
[20:12:58] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[20:13:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[20:13:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:02] <stashbot>	 T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658
[20:13:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23045 and previous config saved to /var/cache/conftool/dbconfig/20220324-201305-marostegui.json
[20:13:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:29] <thcipriani>	 zabe: ah, ok, misread the message, thanks :)
[20:15:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:15:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:15:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:45] <thcipriani>	 jan_drewniak: I misread your message, too -- your change is staged on mwdebug1002, possible to check there? (I forget how portals works)
[20:16:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:16:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:44] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:17:58] <wikibugs>	 (03CR) 10Thcipriani: Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:18:00] <jan_drewniak>	 thcipriani: thanks! I'll check it now 
[20:18:06] <wikibugs>	 (03PS6) 10Thcipriani: Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:18:17] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:18:59] <jan_drewniak>	 thcipriani: looks good to sync
[20:19:03] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:19:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[20:19:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[20:19:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:27] <thcipriani>	 jan_drewniak: cool -- sync-portals is still the right magic for this?
[20:20:40] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01613 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[20:21:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:23] <jan_drewniak>	 thcipriani: yup :) 
[20:22:24] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized portals/wikipedia.org/assets: Config: [[gerrit:773380|Bumping portals to master (T282012)]] (duration: 00m 52s)
[20:22:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:29] <stashbot>	 T282012: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012
[20:22:41] <wikibugs>	 10SRE, 10Observability-Logging: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10colewhite) >>! In T246998#7804127, @MoritzMuehlenhoff wrote: > Since we've replaced Kibana with Opensearch Dashboards we now actually can use OIDC or SAML it seems:  Indeed!  We have asked Legal to clarify if t...
[20:23:16] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized portals: Config: [[gerrit:773380|Bumping portals to master (T282012)]] (duration: 00m 52s)
[20:23:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:24] <thcipriani>	 ^ jan_drewniak all done!
[20:23:38] <jan_drewniak>	 thcipriani: thanks! 
[20:24:24] <thcipriani>	 jan_drewniak: yw :)
[20:24:52] <thcipriani>	 zabe: your second patch is live on mwdebug (in case there's anything to specific you wanted to test aside from making sure nothing explodes :))
[20:25:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:25:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:25:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:26:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:40] <zabe>	 thcipriani, nothing seems to explode and logstash is clear, so I would say we are good to go
[20:26:55] <thcipriani>	 perfect, thanks
[20:28:30] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized tests: Config: [[gerrit:768255|Stop writing to certain $wmf* global variables (T45956)]] (part I) (duration: 00m 50s)
[20:28:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:36] <stashbot>	 T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956
[20:29:58] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized docroot/noc/db.php: Config: [[gerrit:768255|Stop writing to certain $wmf* global variables (T45956)]] (part II) (duration: 00m 51s)
[20:30:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:10] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:31:30] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:768255|Stop writing to certain $wmf* global variables (T45956)]] (part 3) (duration: 00m 55s)
[20:31:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:31:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:32:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:12] <thcipriani>	 ^ zabe that's patch #2
[20:32:24] <zabe>	 thx
[20:32:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:32:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:09] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:33:16] <wikibugs>	 (03CR) 10Thcipriani: Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:33:35] <thcipriani>	 fun
[20:33:42] <thcipriani>	 zabe: could you rebase your last one for me
[20:33:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:34:09] <zabe>	 ah, sure
[20:34:41] <wikibugs>	 (03CR) 10AGueyte: [C: 03+1] Set IPInfo config for path to MaxMind files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773517 (https://phabricator.wikimedia.org/T304604) (owner: 10Tchanders)
[20:35:41] <wikibugs>	 (03PS2) 10Zabe: Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956)
[20:35:52] <urbanecm>	 thcipriani: hello, can you please ping me when you're done? I'd like to do a workaround for T304529 please.
[20:35:52] <stashbot>	 T304529: scap update-interwiki-cache throws MWException: Setting $wgInterwikiCache to a CDB path is no longer supported - https://phabricator.wikimedia.org/T304529
[20:36:23] <thcipriani>	 urbanecm: yep, will do
[20:36:31] <urbanecm>	 appreciated
[20:36:36] <zabe>	 thcipriani, rebased
[20:37:02] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:37:49] <thcipriani>	 thanks zabe 
[20:38:02] <wikibugs>	 (03Merged) 10jenkins-bot: Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:40:38] <thcipriani>	 zabe: live on mwdebug1002 for any checks you'd like to do
[20:41:44] <zabe>	 thcipriani, lgtm
[20:41:55] <thcipriani>	 thanks for checking :)
[20:42:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:43:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:46] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:773607|Start writing to $wmgAllServices the same value as to $wmfAllServices (T45956)]] (duration: 01m 17s)
[20:43:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:50] <stashbot>	 T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956
[20:43:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:43:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:43:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:10] <thcipriani>	 ^ zabe should be live, nice low bug number for that one :)
[20:44:28] <zabe>	 thanks for your help :)
[20:44:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:44:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:44] <thcipriani>	 thanks for the patch :)
[20:46:29] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] map France to drmrs [dns] - 10https://gerrit.wikimedia.org/r/773245 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack)
[20:54:53] <wikibugs>	 (03PS1) 10Urbanecm: fawiki: Set celebration logo for new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773633 (https://phabricator.wikimedia.org/T304314)
[20:54:58] <urbanecm>	 thcipriani: i see all scheduled patches (but the one from Lucas) are deployed already -- just a reminder that I'd like to do some deployments too :))
[20:58:21] <thcipriani>	 urbanecm: whoops, sorry, you're clear
[20:58:26] <urbanecm>	 thanks!
[20:58:28] <urbanecm>	 taking over
[21:01:08] <wikibugs>	 (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773634
[21:01:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773634 (owner: 10Urbanecm)
[21:02:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack)
[21:02:16] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773634 (owner: 10Urbanecm)
[21:03:19] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 00m 50s)
[21:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:47] <wikibugs>	 (03PS2) 10Urbanecm: fawiki: Set celebration logo for new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773633 (https://phabricator.wikimedia.org/T304314)
[21:03:51] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] fawiki: Set celebration logo for new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773633 (https://phabricator.wikimedia.org/T304314) (owner: 10Urbanecm)
[21:04:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:04:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:11] <wikibugs>	 (03Merged) 10jenkins-bot: fawiki: Set celebration logo for new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773633 (https://phabricator.wikimedia.org/T304314) (owner: 10Urbanecm)
[21:05:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:05:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:05:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:06:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:04] <logmsgbot>	 !log thcipriani@deploy1002 Started deploy [releng/phatality@15f8ec0]: Deploying phatality updates for opensearch 1.2.0
[21:07:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:18] <logmsgbot>	 !log thcipriani@deploy1002 Finished deploy [releng/phatality@15f8ec0]: Deploying phatality updates for opensearch 1.2.0 (duration: 00m 13s)
[21:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:27] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized static/images/mobile/copyright/wikipedia-fawiki-new-year.png: 43385320f417052d8e60791b3cb970e6e3f088d5: fawiki: Set celebration logo for new vector (T304314; 1/2) (duration: 00m 50s)
[21:07:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:34] <stashbot>	 T304314: Requesting temporary logo change for fa.wikipedia.org - https://phabricator.wikimedia.org/T304314
[21:09:09] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 43385320f417052d8e60791b3cb970e6e3f088d5: fawiki: Set celebration logo for new vector (T304314; 2/2) (duration: 00m 53s)
[21:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:20] <inflatador>	 !log bking@cumin1001 restarting blazegraph on wdqs[1003-1013].eqiad.wmnet for T293862
[21:11:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:26] <stashbot>	 T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862
[21:11:33] * urbanecm done
[21:11:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:12:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:13:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:49] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye
[21:13:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS b...
[21:26:33] <wikibugs>	 10SRE, 10envoy, 10serviceops: Better automated validation of Puppet-generated Envoy configs - https://phabricator.wikimedia.org/T304660 (10RLazarus) p:05Triage→03Medium
[21:27:28] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:27:32] <wikibugs>	 (03PS1) 10Razzi: docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644)
[21:28:49] <wikibugs>	 (03PS2) 10Razzi: docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644)
[21:31:06] <wikibugs>	 (03CR) 10Cwhite: Add marcusolsson-json-datasource (031 comment) [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773456 (https://phabricator.wikimedia.org/T304585) (owner: 10Phedenskog)
[21:31:59] <wikibugs>	 (03PS1) 10RLazarus: envoyproxy: Fix most validation errors in the `good` build_envoy_config tests [puppet] - 10https://gerrit.wikimedia.org/r/773642 (https://phabricator.wikimedia.org/T304660)
[21:32:11] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:33:42] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[21:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:10] <wikibugs>	 (03CR) 10Cwhite: Add marcusolsson-json-datasource (031 comment) [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773456 (https://phabricator.wikimedia.org/T304585) (owner: 10Phedenskog)
[21:35:24] <wikibugs>	 (03PS3) 10Razzi: docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644)
[21:35:40] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:35:51] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:36:32] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34546/console" [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi)
[21:36:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi)
[21:37:46] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:38:29] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:38:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:41] <wikibugs>	 (03PS4) 10Razzi: docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644)
[21:41:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi)
[21:41:24] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34547/console" [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi)
[21:42:01] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye
[21:42:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bulls...
[21:44:16] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:45:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23047 and previous config saved to /var/cache/conftool/dbconfig/20220324-214515-marostegui.json
[21:45:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:20] <stashbot>	 T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658
[21:47:50] <wikibugs>	 (03CR) 10RLazarus: envoyproxy: Fix most validation errors in the `good` build_envoy_config tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773642 (https://phabricator.wikimedia.org/T304660) (owner: 10RLazarus)
[21:49:59] <wikibugs>	 (03PS5) 10Razzi: docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644)
[21:51:06] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34548/console" [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi)
[21:54:06] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye
[21:54:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS b...
[21:54:20] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:55:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Dzahn)
[21:59:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[22:00:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P23048 and previous config saved to /var/cache/conftool/dbconfig/20220324-220021-marostegui.json
[22:00:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:01:46] <wikibugs>	 (03PS1) 10Dzahn: geoip::data::maxmind: deactivate timer for downloading of legacy DBs [puppet] - 10https://gerrit.wikimedia.org/r/773648 (https://phabricator.wikimedia.org/T303464)
[22:06:23] <wikibugs>	 (03PS1) 10Dzahn: puppetmaster::geoip: stop using class for legacy maxmind downloads in prod [puppet] - 10https://gerrit.wikimedia.org/r/773649 (https://phabricator.wikimedia.org/T303464)
[22:06:50] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage
[22:06:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::geoip: stop using class for legacy maxmind downloads in prod [puppet] - 10https://gerrit.wikimedia.org/r/773649 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn)
[22:07:25] <ebernhardson>	 !log restart wcqs-blazegraph on wcqs2001 to resolve intermittant BlazegraphFreeAllocatorsDecreasingRapidly
[22:07:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:42] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "As you recommended @hashar I pulled the pruning into a new profile docker::prune; I'm not sure how to factor in the the `if $::realm` chec" [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi)
[22:07:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10cmooney) Ok so just to document we had an issue with the imaging of this, similar to the one in T303296.  I had disabled option 82 in...
[22:08:01] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Email spam from varying tawk.email addresses - https://phabricator.wikimedia.org/T304390 (10Ladsgroup) If that website is a known spam source, simply add a global ban like: `.+\.tawk\.email$`
[22:10:14] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage
[22:10:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:59] <Lucas_WMDE>	 I just remembered I’d signed up for the evening backport window
[22:11:07] <Lucas_WMDE>	 sorry about that mutante thcipriani
[22:11:14] <Lucas_WMDE>	 the config change can wait for Monday, it’s not the end of the world
[22:12:30] <mutante>	 jouncebot: now
[22:12:30] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 47 minute(s)
[22:13:13] <mutante>	 Lucas_WMDE: I just wanted to try the "seen" feature :)
[22:13:18] <Lucas_WMDE>	 ^^
[22:13:39] <Lucas_WMDE>	 there’s also some command that would’ve sent me a message once I came back
[22:13:46] <Lucas_WMDE>	 I’ve never used it myself but someone did it to me not long ago ^^
[22:13:53] <Lucas_WMDE>	 might be @notify, not sure
[22:14:02] <mutante>	 ah, yea
[22:14:12] <mutante>	 a rarely used feature that is actually pretty cool
[22:14:28] <mutante>	 if you still want to deploy, just ask Tyler
[22:14:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host restbase2027.mgmt.codfw.wmnet with reboot policy FORCED
[22:14:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P23049 and previous config saved to /var/cache/conftool/dbconfig/20220324-221526-marostegui.json
[22:15:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:25] <wikibugs>	 (03CR) 10Hoo man: [C: 03+2] Change foaf:homepage value from Literal to IRI [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490 (owner: 10Abbe98)
[22:16:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Dzahn)
[22:17:30] <wikibugs>	 (03Merged) 10jenkins-bot: Change foaf:homepage value from Literal to IRI [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490 (owner: 10Abbe98)
[22:19:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[22:19:54] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1047.eqiad.wmnet with OS bullseye
[22:19:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bulls...
[22:23:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Papaul)
[22:24:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Papaul) 05Open→03Resolved @Andrew this is complete ready for service.
[22:27:33] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Papaul) @Andrew Any other issues with 1016 and 1017 ? If no can we please close this task?  Thanks.
[22:30:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23050 and previous config saved to /var/cache/conftool/dbconfig/20220324-223031-marostegui.json
[22:30:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:37] <stashbot>	 T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658
[22:34:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:40:19] <wikibugs>	 (03PS1) 10Papaul: Add restbase2027 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/773651 (https://phabricator.wikimedia.org/T301399)
[22:40:58] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10Papaul)
[23:04:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2027.mgmt.codfw.wmnet with reboot policy FORCED
[23:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:04:23] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10Dzahn) I could sponsor hexmode (@MarkAHershberger).
[23:05:04] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add restbase2027 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/773651 (https://phabricator.wikimedia.org/T301399) (owner: 10Papaul)
[23:07:15] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10thcipriani) >>! In T302287#7797522, @jbond wrote: >>>! In T302287#7797378, @KFrancis wrote: >> @jbond I am confirming the signed NDA.  Please proceed with the access request.  Thanks!...
[23:08:57] <wikibugs>	 (03PS2) 10Dzahn: puppetmaster::geoip: stop using class for legacy maxmind downloads in prod [puppet] - 10https://gerrit.wikimedia.org/r/773649 (https://phabricator.wikimedia.org/T303464)
[23:33:59] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:35:59] <wikibugs>	 (03PS3) 10Krinkle: Migrate $wmfAllServices to $wmgAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773608 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[23:36:35] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Good to go. As always, stage and verify on mwdebug1002 and confirm there are no errors or exceptions happening prior to syncing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773608 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[23:39:52] <wikibugs>	 (03PS1) 10Ladsgroup: Add fix_user_varbinaries_T298565.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773655 (https://phabricator.wikimedia.org/T298565)
[23:43:35] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) I'm unable to get the disk to go into missing to spin down, spin back up, and set to returned to test rebuilding an array.  I can set it to offline, and thats about it.  Also unable to determine ho...
[23:44:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10Dzahn) a:03TomekSikora.Monsoon
[23:46:30] <wikibugs>	 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10Dzahn) @Arnoldokoth Are you already aware of this change?
[23:57:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2027.codfw.wmnet with OS buster
[23:57:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:17] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2027.codfw.wmnet with OS buster