[00:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T0000). [00:02:02] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage [00:02:03] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [00:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:05] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage [00:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:43] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: host reimage [00:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:29] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage [00:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:06] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage [00:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:43] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: host reimage [00:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:46] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:20:18] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:26:08] PROBLEM - ensure kvm processes are running on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:27:01] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1045.eqiad.wmnet with OS bullseye [00:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:44] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:54] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1044.eqiad.wmnet with OS bullseye [00:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:10] PROBLEM - ensure kvm processes are running on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:33:41] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1046.eqiad.wmnet with OS bullseye [00:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:28] RECOVERY - ensure kvm processes are running on cloudvirt1046 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:38:38] RECOVERY - ensure kvm processes are running on cloudvirt1044 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:51:48] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:08:31] 10SRE, 10MediaWiki-Stakeholders-Group, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Performance-Team (Radar): RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588 (10Renoirb) This has been closed? Has an equivalent idea started under a different name? [01:18:10] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:34:22] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1002.eqiad.wmnet with OS bullseye [01:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:36:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:38:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:44:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:37] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt-wdqs1002.eqiad.wmnet with OS bullseye [01:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:55] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:05:22] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773380 (https://phabricator.wikimedia.org/T128546) [02:10:53] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:11:26] (03PS2) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773380 (https://phabricator.wikimedia.org/T282012) [02:13:05] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:01:05] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:14:07] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:10:05] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:49:16] (03PS4) 10NguoiDungKhongDinhDanh: Fix I7ce58529cdd320a9500dc215291ef1c369cee9d3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773320 (https://phabricator.wikimedia.org/T303579) [05:14:29] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:09:50] (03PS1) 10Razzi: karapace: remove Type=notify [puppet] - 10https://gerrit.wikimedia.org/r/773387 (https://phabricator.wikimedia.org/T301565) [06:16:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [06:16:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [06:16:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on 12 hosts with reason: Maintenance [06:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on 12 hosts with reason: Maintenance [06:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:21] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:37:37] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 129, down: 6, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:48:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P23012 and previous config saved to /var/cache/conftool/dbconfig/20220324-064823-root.json [06:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:59] RECOVERY - puppet last run on ml-serve1001 is OK: OK: Puppet is currently disabled (elukey - cni testing), not alerting. Last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:58:37] (03PS1) 10Elukey: install_server: update netboot settings for kubernetes nodes on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/773389 (https://phabricator.wikimedia.org/T300744) [06:58:39] (03PS1) 10Elukey: Set bullseye + overlayfs settings for kubernetes1012 [puppet] - 10https://gerrit.wikimedia.org/r/773390 (https://phabricator.wikimedia.org/T300744) [06:59:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 for testing', diff saved to https://phabricator.wikimedia.org/P23013 and previous config saved to /var/cache/conftool/dbconfig/20220324-065940-marostegui.json [06:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1 and apergos: May I have your attention please! UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T0700) [07:00:17] (03CR) 10Elukey: [C: 03+2] install_server: update netboot settings for kubernetes nodes on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/773389 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [07:00:24] There are no trainees signed up and no patches scheduled in the window [07:00:38] maybe just as well since for some of us this is happening at 9 am :-D [07:00:51] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:02:03] checking --^ [07:03:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P23014 and previous config saved to /var/cache/conftool/dbconfig/20220324-070327-root.json [07:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: After testing', diff saved to https://phabricator.wikimedia.org/P23015 and previous config saved to /var/cache/conftool/dbconfig/20220324-070513-root.json [07:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:12] (03PS1) 10Marostegui: db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/773391 [07:07:53] (03CR) 10Marostegui: [C: 03+2] db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/773391 (owner: 10Marostegui) [07:08:14] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs settings for kubernetes1012 [puppet] - 10https://gerrit.wikimedia.org/r/773390 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [07:08:55] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1012.eqiad.wmnet with OS bullseye [07:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:37] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:17:58] (KubernetesCalicoDown) firing: kubernetes1012.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:18:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P23016 and previous config saved to /var/cache/conftool/dbconfig/20220324-071832-root.json [07:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: After testing', diff saved to https://phabricator.wikimedia.org/P23017 and previous config saved to /var/cache/conftool/dbconfig/20220324-072017-root.json [07:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:23] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1012.eqiad.wmnet with reason: host reimage [07:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1012.eqiad.wmnet with reason: host reimage [07:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:52] (03CR) 10Tchanders: [C: 03+1] Add IPInfo to BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran) [07:33:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P23018 and previous config saved to /var/cache/conftool/dbconfig/20220324-073337-root.json [07:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: After testing', diff saved to https://phabricator.wikimedia.org/P23019 and previous config saved to /var/cache/conftool/dbconfig/20220324-073520-root.json [07:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:11] (03CR) 10Tchanders: [C: 03+1] Add IPInfo to BetaFeatures (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran) [07:39:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1012.eqiad.wmnet with OS bullseye [07:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:48] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:42:58] (KubernetesCalicoDown) resolved: kubernetes1012.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:44:59] (03PS4) 10Majavah: kubeadm::helm: use systemd::environment [puppet] - 10https://gerrit.wikimedia.org/r/773274 [07:45:01] (03PS4) 10Majavah: kubeadm::helm: configure default HELMFILE_ENVIRONMENT [puppet] - 10https://gerrit.wikimedia.org/r/773275 (https://phabricator.wikimedia.org/T304532) [07:45:03] (03PS1) 10Majavah: kubeadm::helm: remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/773438 [07:45:30] PROBLEM - Check systemd state on netflow6001 is CRITICAL: CRITICAL - degraded: The following units failed: sfacctd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P23020 and previous config saved to /var/cache/conftool/dbconfig/20220324-074841-root.json [07:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: After testing', diff saved to https://phabricator.wikimedia.org/P23021 and previous config saved to /var/cache/conftool/dbconfig/20220324-075024-root.json [07:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:45] RECOVERY - Check systemd state on netflow6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:42] (03PS1) 10Marostegui: switchover-tmpl.sh: Add "Affected wikis" field [software] - 10https://gerrit.wikimedia.org/r/773440 (https://phabricator.wikimedia.org/T303605) [07:58:27] (03PS1) 10Majavah: Remove unused CentralAuth settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773441 [08:00:05] dancy, hashar, brennen, dduvall, jeena, and jnuche: Time to snap out of that daydream and deploy 🚂🧪Trainsperiment Week Deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T0800). [08:00:41] (03PS1) 10Jcrespo: mediabackup: Update s4 backup in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/773442 (https://phabricator.wikimedia.org/T299764) [08:03:44] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1013 [puppet] - 10https://gerrit.wikimedia.org/r/773443 (https://phabricator.wikimedia.org/T300744) [08:05:05] (03PS1) 10Jcrespo: Add new command line utility to update existing metadata [software/mediabackups] - 10https://gerrit.wikimedia.org/r/773444 (https://phabricator.wikimedia.org/T299764) [08:05:10] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting: Icinga alert for hosts with no Puppet roles - https://phabricator.wikimedia.org/T238006 (10fgiunchedi) 05Open→03Declined I think nowadays an host with no role will cause puppet to fail and therefore the reimage cookbook to fail to... [08:05:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: After testing', diff saved to https://phabricator.wikimedia.org/P23022 and previous config saved to /var/cache/conftool/dbconfig/20220324-080528-root.json [08:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:16] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:11:43] !log dbmaint s7@codfw T302658 [08:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:50] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [08:12:08] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1013 [puppet] - 10https://gerrit.wikimedia.org/r/773443 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:12:35] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1013.eqiad.wmnet with OS bullseye [08:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:11] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Update s4 backup in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/773442 (https://phabricator.wikimedia.org/T299764) (owner: 10Jcrespo) [08:21:58] (KubernetesCalicoDown) firing: kubernetes1013.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:23:52] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:27:55] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1013.eqiad.wmnet with reason: host reimage [08:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:27] 10SRE, 10Traffic-Icebox: Multiple ATS HTTP2 stats missing from Prometheus - https://phabricator.wikimedia.org/T292817 (10fgiunchedi) - observability since there's no action ATM, feel free to retag when needed [08:29:32] (03PS1) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448 [08:30:43] (03CR) 10Giuseppe Lavagetto: Introduce requestctl (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [08:31:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1013.eqiad.wmnet with reason: host reimage [08:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:58] (KubernetesCalicoDown) resolved: kubernetes1013.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:32:13] (03PS2) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448 [08:32:28] (KubernetesCalicoDown) firing: kubernetes1013.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:33:36] (03PS3) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448 [08:33:43] (03PS6) 10Giuseppe Lavagetto: cache: enable dynamic bans everywhere [puppet] - 10https://gerrit.wikimedia.org/r/769390 (https://phabricator.wikimedia.org/T302471) [08:35:56] (03PS4) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448 [08:36:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache: enable dynamic bans everywhere [puppet] - 10https://gerrit.wikimedia.org/r/769390 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [08:36:27] !log depool cp1078 for reimage - T290005 [08:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:32] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:37:28] (03PS3) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) [08:37:28] (KubernetesCalicoDown) resolved: kubernetes1013.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:38:15] good morning [08:38:28] (KubernetesCalicoDown) firing: kubernetes1013.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:39:20] (03Abandoned) 10Hashar: parser: Revert 2 media gallery changes [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773317 (https://phabricator.wikimedia.org/T304564) (owner: 10Brennen Bearnes) [08:39:26] 10SRE, 10Observability-Metrics: Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Fixed with Grafana 8 upgrade [08:39:58] jnuche: turns out wmf.4 got rolled back yesterday due to a parser issue ( https://gerrit.wikimedia.org/r/q/bug:T304564 ) [08:39:58] T304564: MWException: `[title]` is not a valid file title. - https://phabricator.wikimedia.org/T304564 [08:40:43] I am wondering whether we should move wmf.4 forward today or just abandon it :D [08:41:31] I guess I will do the backport [08:41:33] revisit the log [08:41:37] and move wmf.4 forward again [08:41:40] (03CR) 10MMandere: [C: 03+2] site: Reimage cp1078 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773198 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [08:42:04] (03PS5) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448 [08:42:13] (KubernetesCalicoDown) resolved: kubernetes1013.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:43:06] (03PS1) 10Abbe98: Change foaf:homepage value from Literal to IRI [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773450 [08:43:11] hashar: let me try my hand at the backport [08:43:20] (03PS1) 10Hashar: Broken media in galleries might not have the file namespace [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773321 (https://phabricator.wikimedia.org/T304564) [08:43:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1013.eqiad.wmnet with OS bullseye [08:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:26] !log oblivian@puppetmaster1001 conftool action : set/enabled=true; selector: name=parameter_q,cluster=cache-text [08:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:15] oh there is another blocker https://phabricator.wikimedia.org/T304559 :-\ [08:44:39] !log dbmaint s7@eqiad T302658 [08:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:45] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [08:45:11] !log oblivian@puppetmaster1001 conftool action : set/enabled=false; selector: name=parameter_q,cluster=cache-text [08:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:47] (03CR) 10Abbe98: "Hi! Adding you as a reviewer because you have made similar patches in the past." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773450 (owner: 10Abbe98) [08:48:29] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1078.eqiad.wmnet with OS buster [08:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:37] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1078.eqiad.wmnet with OS buster [08:55:16] (03CR) 10Jaime Nuche: [C: 03+2] Broken media in galleries might not have the file namespace [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773321 (https://phabricator.wikimedia.org/T304564) (owner: 10Hashar) [08:55:45] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:00:57] !log oblivian@puppetmaster1001 conftool action : set/enabled=true; selector: name=parameter_q,cluster=cache-text [09:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:03] (03CR) 10Jaime Nuche: [V: 03+2] Broken media in galleries might not have the file namespace [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773321 (https://phabricator.wikimedia.org/T304564) (owner: 10Hashar) [09:05:29] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1078.eqiad.wmnet with reason: host reimage [09:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:46] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:07:26] (03CR) 10Jaime Nuche: [V: 03+2 C: 03+2] Broken media in galleries might not have the file namespace [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773321 (https://phabricator.wikimedia.org/T304564) (owner: 10Hashar) [09:08:06] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for imagecatalog [puppet] - 10https://gerrit.wikimedia.org/r/773205 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:08:08] PROBLEM - Confd vcl based reload on cp2035 is CRITICAL: reload-vcl failed to run since 0h, 6 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:08:37] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for uwsgi/graphite-web [puppet] - 10https://gerrit.wikimedia.org/r/773190 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:08:56] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1078.eqiad.wmnet with reason: host reimage [09:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:42] (03PS1) 10Giuseppe Lavagetto: cache::base: add check to netpmapper modification [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471) [09:18:04] (03PS6) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448 [09:18:24] (03PS1) 10Muehlenhoff: Extend edges alias to also include drmrs now that the site is live [puppet] - 10https://gerrit.wikimedia.org/r/773452 [09:18:31] (03Merged) 10jenkins-bot: Broken media in galleries might not have the file namespace [core] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/773321 (https://phabricator.wikimedia.org/T304564) (owner: 10Hashar) [09:20:38] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:20:45] (JobUnavailable) resolved: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:20:58] (03PS2) 10Giuseppe Lavagetto: cache::base: add check to netpmapper modification [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471) [09:21:00] (03PS1) 10Giuseppe Lavagetto: varnish::frontend: rmeove temporary rate-limits [puppet] - 10https://gerrit.wikimedia.org/r/773454 [09:21:02] (03PS1) 10Giuseppe Lavagetto: varnish::frontend: remove normalization for parameter [puppet] - 10https://gerrit.wikimedia.org/r/773455 [09:26:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:07] !log jnuche@deploy1002 Synchronized php-1.39.0-wmf.4/includes/Linker.php: (no justification provided) (duration: 00m 50s) [09:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:28:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:22] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1078.eqiad.wmnet with OS buster [09:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:31] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1078.eqiad.wmnet with OS buster com... [09:31:37] !log pool cp1078 with HAProxy as TLS termination layer - T290005 [09:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:42] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:32:55] (03PS1) 10Phedenskog: Add marcusolsson-json-datasource [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773456 (https://phabricator.wikimedia.org/T304585) [09:40:23] (03CR) 10David Caro: "Mostly questions, any nits can be ignored" [puppet] - 10https://gerrit.wikimedia.org/r/773448 (owner: 10Majavah) [09:46:47] (03PS7) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448 [09:47:52] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1096.eqiad.wmnet [09:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:06] (03PS8) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448 [09:51:12] (03CR) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/773448 (owner: 10Majavah) [09:53:28] (03PS1) 10Ayounsi: Add sflow support to prod l3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/773458 (https://phabricator.wikimedia.org/T263277) [09:56:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1096.eqiad.wmnet [09:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1000). [10:01:39] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1097.eqiad.wmnet [10:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:53] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1014 [puppet] - 10https://gerrit.wikimedia.org/r/773466 (https://phabricator.wikimedia.org/T300744) [10:06:55] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1017 [puppet] - 10https://gerrit.wikimedia.org/r/773467 (https://phabricator.wikimedia.org/T302208) [10:08:51] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/773458 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [10:09:14] (03CR) 10Elukey: [C: 03+1] decommission kubernetes[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/771850 (https://phabricator.wikimedia.org/T303044) (owner: 10Alexandros Kosiaris) [10:09:25] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1014 [puppet] - 10https://gerrit.wikimedia.org/r/773466 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:09:55] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1014.eqiad.wmnet with OS bullseye [10:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:34] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1097.eqiad.wmnet [10:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:24] (03CR) 10Ayounsi: "Example diff for lsw1-e2:" [homer/public] - 10https://gerrit.wikimedia.org/r/773458 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [10:17:53] (03CR) 10Ayounsi: [C: 03+2] Add sflow support to prod l3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/773458 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [10:18:05] (03PS34) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [10:18:31] (03Merged) 10jenkins-bot: Add sflow support to prod l3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/773458 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [10:19:13] 10ops-eqiad, 10DC-Ops, 10Traffic: cp1090.mgmt ssh port not accessible - https://phabricator.wikimedia.org/T304589 (10MMandere) p:05Triage→03Medium [10:19:58] (KubernetesCalicoDown) firing: kubernetes1014.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:20:31] !log depool cp1076 for reimage - T290005 [10:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:36] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:22:03] (03PS1) 10Ayounsi: Add eqiad EVPN overlay loopbacks to network::infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/773468 (https://phabricator.wikimedia.org/T263277) [10:23:38] (03CR) 10MMandere: [C: 03+2] site: Reimage cp1076 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773199 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:24:33] (03PS3) 10MMandere: site: Reimage cp1076 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773199 (https://phabricator.wikimedia.org/T290005) [10:25:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1014.eqiad.wmnet with reason: host reimage [10:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:05] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:26:16] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1098.eqiad.wmnet [10:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:14] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1076.eqiad.wmnet with OS buster [10:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:24] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1076.eqiad.wmnet with OS buster [10:28:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1014.eqiad.wmnet with reason: host reimage [10:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:27] (03CR) 10Arturo Borrero Gonzalez: "thanks for working on this! some comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/773448 (owner: 10Majavah) [10:31:09] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:31:29] (03PS1) 10Ayounsi: Add static route leak for sflow collector in EVPN setup [homer/public] - 10https://gerrit.wikimedia.org/r/773470 (https://phabricator.wikimedia.org/T263277) [10:32:14] (03CR) 10Arturo Borrero Gonzalez: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773448 (owner: 10Majavah) [10:32:45] (03CR) 10Arturo Borrero Gonzalez: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/773448 (owner: 10Majavah) [10:33:31] (03PS3) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) [10:33:45] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1098.eqiad.wmnet [10:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:53] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1099.eqiad.wmnet [10:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:52] (03CR) 10JMeybohm: Add helm charts and a helmfile configuration for datahub (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:37:12] (03PS4) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) [10:39:58] (KubernetesCalicoDown) resolved: kubernetes1014.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:40:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1014.eqiad.wmnet with OS bullseye [10:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1099.eqiad.wmnet [10:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:38] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1100.eqiad.wmnet [10:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:46] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1017 [puppet] - 10https://gerrit.wikimedia.org/r/773467 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [10:43:50] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1076.eqiad.wmnet with reason: host reimage [10:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:17] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1017.eqiad.wmnet with OS bullseye [10:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:44] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1076.eqiad.wmnet with reason: host reimage [10:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1100.eqiad.wmnet [10:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:04] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1101.eqiad.wmnet [10:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:46] (03PS5) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) [10:54:28] (KubernetesCalicoDown) firing: (2) kubernetes1014.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:56:15] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1101.eqiad.wmnet [10:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:02] 1014 should not be alarming, checking [11:00:13] (KubernetesCalicoDown) resolved: kubernetes1017.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:00:40] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1017.eqiad.wmnet with reason: host reimage [11:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:01] weird, the calico pod on 1014 is up [11:02:21] and I don't see the alert anymore in alerts.w.o, maybe it is going to auto-solve [11:04:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1017.eqiad.wmnet with reason: host reimage [11:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:28] (KubernetesCalicoDown) firing: kubernetes1017.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:09:28] (KubernetesCalicoDown) resolved: kubernetes1017.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:10:04] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1076.eqiad.wmnet with OS buster [11:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:13] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1076.eqiad.wmnet with OS buster com... [11:10:28] (KubernetesCalicoDown) firing: kubernetes1017.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:11:38] (03CR) 10David Caro: [C: 03+2] "This is meant to be merged after a puppet run has gone through with just the previous patches right?" [puppet] - 10https://gerrit.wikimedia.org/r/773438 (owner: 10Majavah) [11:12:14] (03CR) 10Majavah: kubeadm::helm: remove absented file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773438 (owner: 10Majavah) [11:14:06] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34530/console" [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto) [11:14:26] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [11:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Gotta love how deprecations happen within the same major api version." [puppet] - 10https://gerrit.wikimedia.org/r/773364 (https://phabricator.wikimedia.org/T303230) (owner: 10RLazarus) [11:15:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1017.eqiad.wmnet with OS bullseye [11:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:33] (03PS1) 10Elukey: sre.kafka.roll-restart-brokers: generalize the restart reason [cookbooks] - 10https://gerrit.wikimedia.org/r/773475 [11:16:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, one typo inline." [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:18:45] (03PS19) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [11:19:26] (03CR) 10jerkins-bot: [V: 04-1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [11:20:13] (KubernetesCalicoDown) resolved: kubernetes1017.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:21:39] !log removing old api.svc.codfw.wmnet.pem and appservers.svc.codfw.wmnet.pem from root@puppetmaster1001:/var/lib/puppet/server/ssl/ca/signed# [11:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:01] (03PS6) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) [11:22:19] (03CR) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto) [11:22:48] !log puppet cert clean rendering.svc.eqiad.wmnet [11:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:24:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto) [11:26:45] !log pool cp1076 with HAProxy as TLS termination layer - T290005 [11:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:50] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:28:26] (03PS35) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [11:28:33] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [11:38:12] (03PS1) 10Muehlenhoff: profile::java: Also add component/jdk on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/773476 [11:39:04] (03PS2) 10Daniel Kinzler: Set MW_USE_CONFIG_SCHEMA constant if file exists. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772937 (https://phabricator.wikimedia.org/T304460) [11:41:23] RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [11:41:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/773476 (owner: 10Muehlenhoff) [11:42:35] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:44:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:45:49] (03CR) 10Zabe: [C: 03+1] Remove unused CentralAuth settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773441 (owner: 10Majavah) [11:46:13] (03CR) 10Hoo man: [C: 04-1] "Thanks for looking into this, this should indeed be changed." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773450 (owner: 10Abbe98) [11:46:19] (03PS2) 10Majavah: kubeadm::helm: remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/773438 [11:46:21] (03PS9) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh [puppet] - 10https://gerrit.wikimedia.org/r/773448 (https://phabricator.wikimedia.org/T303931) [11:47:06] (03CR) 10Majavah: toolforge: deploy ingress-nginx via helmfile and provide deploy.sh (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/773448 (https://phabricator.wikimedia.org/T303931) (owner: 10Majavah) [11:47:25] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:47:41] !log updating eqiad swift-commonswiki backups of originals T299764 [11:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:49] T299764: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 [11:54:02] (03PS1) 10Majavah: P:cache::varnish::frontend: fix duplicate resource declarations [puppet] - 10https://gerrit.wikimedia.org/r/773477 [11:55:48] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34531/console" [puppet] - 10https://gerrit.wikimedia.org/r/773477 (owner: 10Majavah) [11:58:34] (03CR) 10Majavah: "broke puppet on deployment-prep, fix is Icf78fb25cf7594ad1dc3dda72b5a09eddd018481" [puppet] - 10https://gerrit.wikimedia.org/r/772401 (owner: 10Giuseppe Lavagetto) [11:58:52] (03CR) 10Jbond: "lgtm" [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [12:01:54] (03CR) 10JMeybohm: [C: 04-1] Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [12:05:11] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10JMeybohm) >>! In T304237#7790839, @Volans wrote: > ` > root@puppetmaster1001:~# for file in $(ls /var/lib/puppet/server/ssl/ca/signe... [12:07:37] (03PS7) 10Jbond: P:environment: Add ablilty to inject environment variables [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) [12:07:57] (03CR) 10Jbond: "thans updated" [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [12:08:04] (03PS8) 10Jbond: P:environment: Add ablilty to inject environment variables [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) [12:11:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [12:11:32] (03CR) 10Jbond: [C: 03+2] P:environment: Add ablilty to inject environment variables [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [12:16:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/773476 (owner: 10Muehlenhoff) [12:17:27] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:22:56] (03PS1) 10Abbe98: Change foaf:homepage value from Literal to IRI [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490 [12:25:53] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:30:12] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10jbond) p:05Triage→03Medium [12:33:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:34:37] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: default to deploy.sh as deployment command [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773491 (https://phabricator.wikimedia.org/T303931) [12:35:49] (03CR) 10Majavah: [C: 03+1] wmcs: toolforge: k8s: default to deploy.sh as deployment command [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773491 (https://phabricator.wikimedia.org/T303931) (owner: 10Arturo Borrero Gonzalez) [12:36:26] (03Abandoned) 10Abbe98: Change foaf:homepage value from Literal to IRI [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773450 (owner: 10Abbe98) [12:37:52] 10SRE, 10Data-Engineering: Adding snwachukwu@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T304541 (10jbond) 05Open→03Resolved a:03jbond This has been completed [12:38:14] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10jbond) p:05Triage→03Medium [12:38:24] (03CR) 10Hoo man: [C: 04-1] "One nitpick, look's fine otherwise." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490 (owner: 10Abbe98) [12:38:30] (03CR) 10Muehlenhoff: [C: 03+2] profile::java: Also add component/jdk on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/773476 (owner: 10Muehlenhoff) [12:38:55] (03CR) 10Muehlenhoff: profile::java: Also add component/jdk on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773476 (owner: 10Muehlenhoff) [12:43:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:43:29] (03PS20) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [12:43:42] (03CR) 10CDanis: "looks good enough to me just some nits" [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [12:44:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: k8s: default to deploy.sh as deployment command [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773491 (https://phabricator.wikimedia.org/T303931) (owner: 10Arturo Borrero Gonzalez) [12:52:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1158 for schema change', diff saved to https://phabricator.wikimedia.org/P23023 and previous config saved to /var/cache/conftool/dbconfig/20220324-125225-marostegui.json [12:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:52] (03CR) 10Jforrester: [C: 03+1] "This is fine to go; any comment adjustment can be made later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran) [12:54:14] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2001-dev.codfw.wmnet with OS bullseye [12:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:50] (03PS2) 10MSantos: maps: allow bbcrewind to access maps public urls [puppet] - 10https://gerrit.wikimedia.org/r/772462 (https://phabricator.wikimedia.org/T297968) [12:56:01] (03PS2) 10Tchanders: Add IPInfo to BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran) [12:56:15] (03CR) 10Tchanders: [C: 03+1] Add IPInfo to BetaFeatures (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran) [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1300). [13:00:05] zabe and Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:02:01] Hi! I'm around to test if anyone is around to deploy? [13:03:14] o/ [13:04:09] * Reedy looks [13:04:57] (03CR) 10Reedy: [C: 03+2] Add IPInfo to BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran) [13:05:52] (03Merged) 10jenkins-bot: Add IPInfo to BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773348 (https://phabricator.wikimedia.org/T292802) (owner: 10STran) [13:05:59] (03CR) 10JMeybohm: [C: 03+1] "Don't know if the type is overkill, so +1 with comment :)" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [13:07:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [13:07:14] Tchanders: It's on mwdebug1001 [13:07:23] (03PS3) 10Reedy: Stop writing to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771469 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:07:29] Reedy: Taking a look - thanks [13:08:11] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/34533/netflow1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/773468 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [13:09:14] (03PS2) 10Abbe98: Change foaf:homepage value from Literal to IRI [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490 [13:09:15] Reedy: Looks great [13:09:22] sweet [13:09:56] (03CR) 10Reedy: [C: 03+2] Stop writing to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771469 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:10:26] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T292802 (duration: 00m 50s) [13:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:31] T292802: IP Info feature should be made available as a Beta feature for launch [M] - https://phabricator.wikimedia.org/T292802 [13:10:38] (03Merged) 10jenkins-bot: Stop writing to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771469 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:11:20] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [13:11:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:44] zabe: Yours is on mwdebug1001 too... As far as we can test it ;D [13:12:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [13:12:18] (I also double checked for usages of the wmf global) [13:12:52] (03CR) 10JMeybohm: [C: 04-1] Initial debianization of istio-cni (033 comments) [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [13:12:57] Reedy, nothing seems to break and logstash looks clear, so I would say we are good to go [13:14:02] (03CR) 10Abbe98: "Indentation fixed." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490 (owner: 10Abbe98) [13:15:13] !log reedy@deploy1002 Synchronized tests/: T45956 (duration: 00m 49s) [13:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:18] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [13:15:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:15:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:37] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:18:52] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: host reimage [13:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:35] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: host reimage [13:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P23024 and previous config saved to /var/cache/conftool/dbconfig/20220324-132217-root.json [13:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:22:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:45] !log reedy@deploy1002 Synchronized multiversion/: T45956 (duration: 00m 50s) [13:23:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:49] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [13:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:02] !log reedy@deploy1002 Synchronized wmf-config/CommonSettings.php: T45956 (duration: 00m 49s) [13:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:07] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [13:33:23] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:34:08] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudgw2001-dev.codfw.wmnet with OS bullseye [13:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:26] (03PS3) 10JMeybohm: Add *.k8s-staging.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) [13:37:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P23025 and previous config saved to /var/cache/conftool/dbconfig/20220324-133721-root.json [13:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:53] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH >>! In T297913#7801355, @RobH wrote: > > So I guess this kernel change broke it entirely? No, you were using the wrong command :-) "perccli" is a... [13:42:22] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10cmooney) Just to update here. No solution as of yet, Juniper are also of the belief it is a bug in how their software processes ARPs, and the interaction be... [13:43:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [13:43:20] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2002-dev.codfw.wmnet with OS bullseye [13:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:58] (03PS1) 10Ssingh: dnsdist: remove redundant rate limits (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/773503 [13:47:24] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10MatthewVernon) Thanks for the update, and I'm glad some progress is being made :) From my POV, I don't need this hardware just now; so happy with it staying... [13:48:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [13:50:43] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:51:21] uh [13:52:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P23026 and previous config saved to /var/cache/conftool/dbconfig/20220324-135225-root.json [13:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:55] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:57:28] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: host reimage [13:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:24] (03PS5) 10Elukey: Initial debianization of istio-cni [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) [13:59:08] (03CR) 10Elukey: Initial debianization of istio-cni (033 comments) [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [14:00:57] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: host reimage [14:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:02:14] (03CR) 10Filippo Giunchedi: [C: 03+1] C:icinga::monitor::cloudelastic: Add checkes for certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/773214 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:03:25] (03CR) 10Filippo Giunchedi: [C: 03+1] C:icinga::commons: Add ssl expiry checks for commons [puppet] - 10https://gerrit.wikimedia.org/r/773217 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:03:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, also what Riccardo said" [puppet] - 10https://gerrit.wikimedia.org/r/773218 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:04:40] (03CR) 10Filippo Giunchedi: C:icinga::commons: Add ssl expiry checks for gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773219 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:05:44] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize deploy code into a class [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773509 [14:05:46] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize build code into a class So we can easily reuse it easily from different cookbooks. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773510 [14:05:49] (03CR) 10Filippo Giunchedi: [C: 03+1] C:icinga::gitlab: Add ssl expiry checks for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/773220 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:07:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P23028 and previous config saved to /var/cache/conftool/dbconfig/20220324-140729-root.json [14:07:32] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34535/console" [puppet] - 10https://gerrit.wikimedia.org/r/773503 (owner: 10Ssingh) [14:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:42] (03CR) 10Ayounsi: [C: 03+2] "Example diff on lsw1-f2:" [homer/public] - 10https://gerrit.wikimedia.org/r/773470 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [14:08:24] (03Merged) 10jenkins-bot: Add static route leak for sflow collector in EVPN setup [homer/public] - 10https://gerrit.wikimedia.org/r/773470 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [14:09:33] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize build code into a class [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773510 [14:11:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:icinga::monitor::cloudelastic: refactor to make a bit more DRY [puppet] - 10https://gerrit.wikimedia.org/r/773213 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:11:34] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2002-dev.codfw.wmnet with OS bullseye [14:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [14:13:14] (03PS36) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [14:13:39] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: remove redundant rate limits (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/773503 (owner: 10Ssingh) [14:14:39] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:18:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [14:18:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) (owner: 10Cwhite) [14:19:21] (03CR) 10Jforrester: "Aha, yes, the .com is the primary for that sub-domain. I don't know if that's OK for all sub-domains, but we did that for wikimedia.org, s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773302 (https://phabricator.wikimedia.org/T304555) (owner: 10Arlolra) [14:20:29] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [14:21:04] (03CR) 10Filippo Giunchedi: "I'm sorry I currently don't have the bandwidth to take this on (+Matthew as he might)" [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) (owner: 10Jcrespo) [14:22:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P23029 and previous config saved to /var/cache/conftool/dbconfig/20220324-142233-root.json [14:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:46] backup1001: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/bacula/jobs.d/deploy2002.codfw.wmnet-/etc/helmfile-defaults/mediawiki/release-Monthly-1st-Thu-production.conf],File[/etc/bacula/jobs.d/deploy1002.eqiad.wmnet-/etc/helmfile-defaults/mediawiki/release-Monthly-1st-Tue-production.conf] [14:23:26] someone working with deploy servers? [14:24:02] (03PS1) 10Tchanders: Set IPInfo config for path to MaxMind files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773517 (https://phabricator.wikimedia.org/T304604) [14:26:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [14:26:23] PROBLEM - SSH on kubernetes2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:41] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) (owner: 10Cwhite) [14:27:13] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:28:46] (03CR) 10Filippo Giunchedi: "LGTM, Cole what do you think ?" [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773456 (https://phabricator.wikimedia.org/T304585) (owner: 10Phedenskog) [14:29:22] (03CR) 10David Caro: "Why not expose it as a cookbook instead?" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773510 (owner: 10Arturo Borrero Gonzalez) [14:30:52] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [14:31:35] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:31:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [14:31:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [14:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23030 and previous config saved to /var/cache/conftool/dbconfig/20220324-143149-marostegui.json [14:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:56] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [14:34:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:39] !log installing containerd updates on ml-serve* [14:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:06] (03CR) 10Jbond: [C: 03+2] C:icinga::commons: Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773218 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:42:52] (03PS3) 10Jbond: C:icinga::monitor::cloudelastic: Add checkes for certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/773214 (https://phabricator.wikimedia.org/T304321) [14:43:05] RECOVERY - Confd vcl based reload on cp2035 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:43:36] (03CR) 10Filippo Giunchedi: [C: 04-1] "Not opposed in theory, though given how critical (hah!) check_http is we must make sure we get some form of testing for the script going" [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:48:20] (03PS1) 10Elukey: kubernetes: clean up extra netboot and host settings [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) [14:49:11] (03CR) 10Jbond: [C: 03+2] C:icinga::monitor::cloudelastic: Add checkes for certificate expiry [puppet] - 10https://gerrit.wikimedia.org/r/773214 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:56:33] (03PS2) 10Elukey: kubernetes: clean up extra netboot and host settings [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) [14:58:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34538/console" [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:59:21] (03CR) 10Jbond: [C: 03+2] C:icinga::gitlab: Add ssl expiry checks for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/773220 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:59:32] (03CR) 10Jbond: [C: 03+2] C:icinga::commons: Add ssl expiry checks for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/773219 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:59:58] (03PS2) 10Jbond: C:icinga::commons: Add ssl expiry checks for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/773219 (https://phabricator.wikimedia.org/T304321) [15:00:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:00:15] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/773219 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:00:53] (03CR) 10Jbond: [C: 03+2] C:icinga::commons: Add ssl expiry checks for commons [puppet] - 10https://gerrit.wikimedia.org/r/773217 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:00:55] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:01:38] (03PS3) 10Elukey: kubernetes: clean up extra netboot and host settings [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) [15:01:40] (03PS2) 10Jbond: icinga: move client_auth_puppet_post to use wmf_check_http [puppet] - 10https://gerrit.wikimedia.org/r/773279 (https://phabricator.wikimedia.org/T304321) [15:01:55] (03CR) 10Jbond: [C: 03+2] P:docker_registry_ha::registry: Add ssl expiry checks [puppet] - 10https://gerrit.wikimedia.org/r/773257 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:02:39] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773254 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:03:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34539/console" [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [15:03:05] (03CR) 10Jbond: [C: 03+2] P:chartmuseum: Add ssl expiry checks for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/773249 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:03:28] (03CR) 10Jbond: [C: 03+2] C:noc: Add ssl expiry checks for noc [puppet] - 10https://gerrit.wikimedia.org/r/773223 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:03:50] !log installing openssl1.0 security updates on stretch [15:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:06:03] (03PS2) 10Jbond: C:openstack::keystone: Add ssl expiry checks for keystone [puppet] - 10https://gerrit.wikimedia.org/r/773224 (https://phabricator.wikimedia.org/T304321) [15:06:19] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10Patch-For-Review: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) @MatthewVernon Filippo said he doesn't have the bandwidth to help with the patch and recommended contacting you. Could you h... [15:09:46] (03CR) 10Jbond: [C: 03+2] C:lvs::monitor_services: Add ssl expiry checks for lvs [puppet] - 10https://gerrit.wikimedia.org/r/773221 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:10:11] (03CR) 10David Caro: [C: 03+1] C:openstack::keystone: Add ssl expiry checks for keystone [puppet] - 10https://gerrit.wikimedia.org/r/773224 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:10:26] (03CR) 10Herron: "LGTM overall!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus) [15:10:42] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 30): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34541/console" [puppet] - 10https://gerrit.wikimedia.org/r/773520 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [15:12:08] (03CR) 10Jbond: [C: 03+2] C:openstack::keystone: Add ssl expiry checks for keystone [puppet] - 10https://gerrit.wikimedia.org/r/773224 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:12:43] (03CR) 10Jbond: [C: 03+2] C:icinga::commons: Add ssl expiry checks for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/773219 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:13:48] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:15:00] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:18:19] (03CR) 10Cwhite: [C: 03+2] profile: Rsyslog omkafka configs use new ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/773285 (https://phabricator.wikimedia.org/T291905) (owner: 10Cwhite) [15:19:52] (03PS1) 10Jbond: P:icinga: Add ssl expiry check to external monitoring [puppet] - 10https://gerrit.wikimedia.org/r/773553 (https://phabricator.wikimedia.org/T304321) [15:21:04] (03PS2) 10MMandere: site: Reimage cp2033 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773250 (https://phabricator.wikimedia.org/T290005) [15:21:06] (03PS2) 10MMandere: site: Reimage cp2031 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773251 (https://phabricator.wikimedia.org/T290005) [15:21:08] (03PS2) 10MMandere: site: Reimage cp2029 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773252 (https://phabricator.wikimedia.org/T290005) [15:21:10] (03PS2) 10MMandere: site: Reimage cp2027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773253 (https://phabricator.wikimedia.org/T290005) [15:21:12] (03PS1) 10MMandere: site: Reimage cp2034 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773554 (https://phabricator.wikimedia.org/T290005) [15:21:14] (03PS1) 10MMandere: site: Reimage cp2032 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773555 (https://phabricator.wikimedia.org/T290005) [15:21:16] (03PS1) 10MMandere: site: Reimage cp2030 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773556 (https://phabricator.wikimedia.org/T290005) [15:21:18] (03PS1) 10MMandere: site: Reimage cp2028 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773557 (https://phabricator.wikimedia.org/T290005) [15:21:43] (03PS1) 10Cmjohnson: Adding ml-cache1001-3 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/773558 (https://phabricator.wikimedia.org/T299435) [15:23:04] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:23:41] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10Papaul) [15:24:31] !log codfw: disable BGP to DE-CIX for link move [15:24:32] (03PS1) 10Jcrespo: bacula: Unbreak director: disable deployment backups [puppet] - 10https://gerrit.wikimedia.org/r/773559 [15:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:28] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2034 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773554 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [15:25:59] (03PS2) 10Jcrespo: bacula: Unbreak director: disable deployment backups [puppet] - 10https://gerrit.wikimedia.org/r/773559 (https://phabricator.wikimedia.org/T299648) [15:26:01] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2032 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773555 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [15:26:32] RECOVERY - SSH on kubernetes2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:26:32] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2030 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773556 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [15:27:02] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp2028 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773557 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [15:28:15] (03CR) 10Jcrespo: "I am guessing this is WIP code- so a quick comment will be the easiest to go until a more permanent solutions is available? This blocks co" [puppet] - 10https://gerrit.wikimedia.org/r/773559 (https://phabricator.wikimedia.org/T299648) (owner: 10Jcrespo) [15:28:47] (03PS1) 10Jbond: P:idp::client::https::site: Add check_http_expiry to idp services [puppet] - 10https://gerrit.wikimedia.org/r/773560 (https://phabricator.wikimedia.org/T304321) [15:28:49] (03CR) 10Cmjohnson: [C: 03+2] Adding ml-cache1001-3 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/773558 (https://phabricator.wikimedia.org/T299435) (owner: 10Cmjohnson) [15:29:29] ^joe can I get a path review? [15:30:07] (03PS1) 10Ebernhardson: elastic: Remove noqa from rolling-operation.py [cookbooks] - 10https://gerrit.wikimedia.org/r/773561 [15:30:09] (03PS1) 10Ebernhardson: elastic: Bring back stopping new replicas during restart [cookbooks] - 10https://gerrit.wikimedia.org/r/773562 [15:31:25] (03PS1) 10Jbond: P:librenms::web: add check_https_expiry [puppet] - 10https://gerrit.wikimedia.org/r/773563 (https://phabricator.wikimedia.org/T304321) [15:32:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) [15:32:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backy2: add link to the runbook for backup_vms [puppet] - 10https://gerrit.wikimedia.org/r/772839 (https://phabricator.wikimedia.org/T304408) (owner: 10David Caro) [15:32:49] (03PS1) 10Muehlenhoff: Enable Ganeti 3 for ganeti-test* [puppet] - 10https://gerrit.wikimedia.org/r/773564 [15:33:02] (03PS1) 10Elukey: Add helmfile config for Istio proxy sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) [15:33:40] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:34:23] (03PS1) 10Jbond: P:lists::monitoring: Add check_https_expiry check [puppet] - 10https://gerrit.wikimedia.org/r/773566 (https://phabricator.wikimedia.org/T304321) [15:34:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) [15:35:12] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:35:31] (03CR) 10Jcrespo: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto) [15:37:32] (03PS1) 10Jbond: P:microsites::peopleweb: add check_http_expiry monitor [puppet] - 10https://gerrit.wikimedia.org/r/773567 (https://phabricator.wikimedia.org/T304321) [15:38:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:39:25] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10UploadWizard, 10Tracking-Neverending: Uploadstash errors (tracking) - https://phabricator.wikimedia.org/T85568 (10Krinkle) [15:39:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1001.eqiad.wmnet with OS bullseye [15:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1001.eqiad.wmnet wit... [15:39:59] (03PS4) 10Elukey: role::kafka::logging: add PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) [15:40:00] (03PS1) 10Jbond: P:netbox: add check_https_expiry [puppet] - 10https://gerrit.wikimedia.org/r/773568 (https://phabricator.wikimedia.org/T304321) [15:40:08] (03CR) 10Elukey: role::kafka::logging: add PKI migration settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [15:41:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1142.mgmt.eqiad.wmnet with reboot policy FORCED [15:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:44:28] (03PS2) 10Arlolra: Add wikimedia.com to wgNoFollowDomainExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773302 (https://phabricator.wikimedia.org/T304555) [15:45:35] (03PS1) 10Jbond: P:phabricator: add check_expiry for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/773571 (https://phabricator.wikimedia.org/T304321) [15:49:16] (03PS1) 10Jbond: P:icinga::debmonitor: correct check definition [puppet] - 10https://gerrit.wikimedia.org/r/773573 (https://phabricator.wikimedia.org/T304321) [15:49:35] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:icinga::debmonitor: correct check definition [puppet] - 10https://gerrit.wikimedia.org/r/773573 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:51:30] (03PS3) 10Jcrespo: bacula: Unbreak director: disable deployment backups [puppet] - 10https://gerrit.wikimedia.org/r/773559 (https://phabricator.wikimedia.org/T299648) [15:51:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1001.eqiad.wmnet with reason: host reimage [15:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:51] 10SRE, 10Data-Engineering, 10Traffic: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617 (10elukey) Adding some context for the Traffic team. There were two varnishkafka versions, one in the `main` component and one in `component/varnish6` of `buster-wikimedia` at the time... [15:56:53] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-cache1001.eqiad.wmnet with reason: host reimage [15:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:28] 10SRE, 10Data-Engineering, 10Traffic: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617 (10BBlack) Thanks for making this ticket and adding those insights! I agree, there have been multiple times in the past that we've had problems in this area, and we should probably pup... [16:00:04] jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [16:02:17] (03CR) 10JMeybohm: bacula: Unbreak director: disable deployment backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773559 (https://phabricator.wikimedia.org/T299648) (owner: 10Jcrespo) [16:03:57] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:07:15] (03CR) 10Razzi: "Am I understanding Type=notify correctly? See commit message" [puppet] - 10https://gerrit.wikimedia.org/r/773387 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi) [16:07:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1001.eqiad.wmnet with OS bullseye [16:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:19] (03CR) 10Jcrespo: bacula: Unbreak director: disable deployment backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773559 (https://phabricator.wikimedia.org/T299648) (owner: 10Jcrespo) [16:07:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1001.eqiad.wmnet with OS... [16:07:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1142.mgmt.eqiad.wmnet with reboot policy FORCED [16:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye [16:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet wit... [16:09:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED [16:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:04] jouncebot nowandnext [16:12:04] For the next 0 hour(s) and 47 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1600) [16:12:04] In 1 hour(s) and 47 minute(s): 🚂🧪Trainsperiment Week Deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1800) [16:12:37] (03CR) 10BBlack: [C: 04-1] "Looking pretty good overall, a couple of comments inline here (maybe remove the TODO part entirely too, if you agree). We should definite" [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [16:12:53] current window appears clear, train's unblocked, we're going ahead to all wikis with wmf.4 [16:13:27] !log trainsperiment (T300203): blockers clear, logs triaged, rolling 1.39.0-wmf.4 out to all wikis again [16:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:33] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [16:15:48] (03PS3) 10RLazarus: slo: Move most of the text panel content to a description field, so it can be overridden [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842) [16:16:27] (03PS1) 10Brennen Bearnes: group0 wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773579 [16:16:28] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773579 (owner: 10Brennen Bearnes) [16:16:34] (03CR) 10RLazarus: slo: Move most of the text panel content to a description field, so it can be overridden (032 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus) [16:17:12] (03PS1) 10Jbond: P:openstack: use correct vhost when checking sl expiry [puppet] - 10https://gerrit.wikimedia.org/r/773580 [16:17:42] (03CR) 10Jbond: [C: 03+2] P:openstack: use correct vhost when checking sl expiry [puppet] - 10https://gerrit.wikimedia.org/r/773580 (owner: 10Jbond) [16:18:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1003.eqiad.wmnet with OS bullseye [16:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye [16:18:50] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773579 (owner: 10Brennen Bearnes) [16:19:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED [16:19:00] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED [16:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:28] 10SRE, 10Observability-Metrics, 10observability, 10Graphite: PHP statsd client doesn't support tagging metrics - https://phabricator.wikimedia.org/T225721 (10fgiunchedi) 05Open→03Declined I believe with {T240685} in mediawiki (i.e. Prometheus / generic tags support) this can be declined (though feel fr... [16:19:52] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1144.mgmt.eqiad.wmnet with reboot policy FORCED [16:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:33] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.4 refs T300203 [16:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:39] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [16:20:51] (03PS4) 10Krinkle: Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [16:21:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage [16:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:22] (03PS1) 10Giuseppe Lavagetto: backup: fix filesets definition for mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/773581 (https://phabricator.wikimedia.org/T299648) [16:21:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/771560 (owner: 10Muehlenhoff) [16:21:36] (03PS1) 10Brennen Bearnes: group1 wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773582 [16:21:38] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773582 (owner: 10Brennen Bearnes) [16:21:40] (03CR) 10Krinkle: [C: 03+1] "TestServices.php and TestServices.php still set wmfMasterServices, that one and other can be removed from there as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [16:22:43] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773582 (owner: 10Brennen Bearnes) [16:23:33] 10SRE, 10Observability-Metrics: Stop using public (cached) endpoints for checks on graphite - https://phabricator.wikimedia.org/T219902 (10fgiunchedi) 05Open→03Declined Boldly declining this since graphite is in life support mode and the lowest hanging fruits have been addressed (thanks!) [16:23:55] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:24:17] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.4 refs T300203 [16:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:18] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage [16:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:25] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.4 refs T300203 (duration: 01m 06s) [16:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:38] (03CR) 10Filippo Giunchedi: "LGTM, see inline for a nit" [puppet] - 10https://gerrit.wikimedia.org/r/773553 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [16:25:45] (03CR) 10Jcrespo: "Looks fine, I only have one question- general_dir is not currently defined? Could it in the future be different between, eg. eqiad and cod" [puppet] - 10https://gerrit.wikimedia.org/r/773581 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto) [16:26:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED [16:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:26:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:28] (03PS2) 10Jbond: P:icinga: Add ssl expiry check to external monitoring [puppet] - 10https://gerrit.wikimedia.org/r/773553 (https://phabricator.wikimedia.org/T304321) [16:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:46] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/773553 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [16:26:50] (03Abandoned) 10Jcrespo: bacula: Unbreak director: disable deployment backups [puppet] - 10https://gerrit.wikimedia.org/r/773559 (https://phabricator.wikimedia.org/T299648) (owner: 10Jcrespo) [16:27:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:29] (03PS1) 10Brennen Bearnes: all wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773583 [16:27:31] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773583 (owner: 10Brennen Bearnes) [16:28:07] (03CR) 10Filippo Giunchedi: [C: 03+1] P:icinga: Add ssl expiry check to external monitoring [puppet] - 10https://gerrit.wikimedia.org/r/773553 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [16:28:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] backup: fix filesets definition for mw on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773581 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto) [16:28:21] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.4 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773583 (owner: 10Brennen Bearnes) [16:29:53] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.4 refs T300203 [16:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:59] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [16:30:25] (03CR) 10Jcrespo: [C: 03+1] backup: fix filesets definition for mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/773581 (https://phabricator.wikimedia.org/T299648) (owner: 10Giuseppe Lavagetto) [16:30:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1003.eqiad.wmnet with reason: host reimage [16:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1145.mgmt.eqiad.wmnet with reboot policy FORCED [16:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:33:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:03] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1003.eqiad.wmnet with reason: host reimage [16:34:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:20] (03CR) 10Jbond: [C: 03+2] P:icinga: Add ssl expiry check to external monitoring [puppet] - 10https://gerrit.wikimedia.org/r/773553 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [16:34:22] (03CR) 10Elukey: [C: 03+2] role::kafka::logging: add PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [16:34:49] jbond: ok to merge? [16:35:03] (03CR) 10BBlack: [C: 03+2] map Spain to drmrs [dns] - 10https://gerrit.wikimedia.org/r/773244 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [16:35:06] please do elukey [16:35:23] done! [16:35:26] thx [16:35:34] (03PS1) 10Arturo Borrero Gonzalez: keepalived: use version from bullseye-bpo [puppet] - 10https://gerrit.wikimedia.org/r/773585 (https://phabricator.wikimedia.org/T304598) [16:35:36] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: don't install kernel or nft from backports [puppet] - 10https://gerrit.wikimedia.org/r/773586 (https://phabricator.wikimedia.org/T304598) [16:35:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1002.eqiad.wmnet with OS bullseye [16:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye completed: -... [16:36:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1144.mgmt.eqiad.wmnet with reboot policy FORCED [16:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:40] 10SRE, 10Observability-Logging, 10Privacy Engineering, 10Wikimedia-Logstash, and 2 others: Production logstash should be protected by two-factor auth, at the least - https://phabricator.wikimedia.org/T237630 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving in favor of {T246998} since that'll... [16:38:11] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06452 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:38:22] 10SRE, 10Observability-Logging: Monitor the BMC's event log for hardware errors - https://phabricator.wikimedia.org/T136311 (10fgiunchedi) Mentioning {T302639} here too since the two are related [16:38:46] (03PS1) 10Cathal Mooney: Add template to configure IPv6 RAs on CRs and L3 Switches [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758) [16:39:15] (03CR) 10Jbond: [C: 03+2] P:phabricator: add check_expiry for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/773571 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [16:39:35] (03CR) 10Jbond: [C: 03+2] P:netbox: add check_https_expiry [puppet] - 10https://gerrit.wikimedia.org/r/773568 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [16:39:59] (03CR) 10Jbond: [C: 03+2] P:microsites::peopleweb: add check_http_expiry monitor [puppet] - 10https://gerrit.wikimedia.org/r/773567 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [16:40:18] (03CR) 10Jbond: [C: 03+2] P:lists::monitoring: Add check_https_expiry check [puppet] - 10https://gerrit.wikimedia.org/r/773566 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [16:40:37] (03CR) 10Jbond: [C: 03+2] P:librenms::web: add check_https_expiry [puppet] - 10https://gerrit.wikimedia.org/r/773563 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [16:41:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/773586 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [16:42:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1146.mgmt.eqiad.wmnet with reboot policy FORCED [16:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:08] 10SRE, 10Observability-Logging: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10MoritzMuehlenhoff) Since we've replaced Kibana with Opensearch Dashboards we now actually can use OIDC or SAML it seems: https://opensearch.org/docs/latest/security-plugin/configuration/openid-connect/ https://... [16:44:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1003.eqiad.wmnet with OS bullseye [16:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye completed: -... [16:46:32] 10SRE, 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10colewhite) @elukey This issue hasn't reappeared since we began dropping the field. If you're ok with keeping this mitigation in place, please feel free to c... [16:47:26] (03CR) 10JMeybohm: Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:47:56] (03CR) 10JMeybohm: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:49:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:23] (03PS1) 10Urbanecm: logos: add commons filename for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773590 [16:49:25] (03PS1) 10Urbanecm: fawiki: Set new year celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773591 (https://phabricator.wikimedia.org/T304314) [16:49:28] (03PS37) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [16:51:02] jouncebot: nowandnext [16:51:02] For the next 0 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1600) [16:51:02] In 1 hour(s) and 8 minute(s): 🚂🧪Trainsperiment Week Deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1800) [16:51:43] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:52:29] jbond: rzl: is anything happening in puppet window? [16:52:33] or can i do a quick mw deploy? [16:55:26] (03PS1) 10Ladsgroup: Enable WRITE BOTH for templatelinks normalization in more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773594 (https://phabricator.wikimedia.org/T299421) [16:55:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:55:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:48] urbanecm: nope, all yours [16:55:52] thanks [16:56:19] (03CR) 10Urbanecm: [C: 03+2] logos: add commons filename for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773590 (owner: 10Urbanecm) [16:56:46] (03CR) 10Jbond: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [16:57:18] (03Merged) 10jenkins-bot: logos: add commons filename for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773590 (owner: 10Urbanecm) [16:57:49] (03CR) 10RLazarus: [C: 03+2] envoy: Move upstream HTTP config into the new HttpProtocolOptions message [puppet] - 10https://gerrit.wikimedia.org/r/773364 (https://phabricator.wikimedia.org/T303230) (owner: 10RLazarus) [16:57:57] (03PS1) 10MSantos: mobileapps: Bunp to 2022-03-24-135848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/773595 [16:58:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1146.mgmt.eqiad.wmnet with reboot policy FORCED [16:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:25] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1145.mgmt.eqiad.wmnet with reboot policy FORCED [16:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED [16:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:57] (03CR) 10Urbanecm: [C: 03+2] fawiki: Set new year celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773591 (https://phabricator.wikimedia.org/T304314) (owner: 10Urbanecm) [17:00:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1148.mgmt.eqiad.wmnet with reboot policy FORCED [17:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:00:37] 10SRE, 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10elukey) 05Open→03Resolved I am yes! Thanks a lot for the support! [17:00:58] (03Merged) 10jenkins-bot: fawiki: Set new year celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773591 (https://phabricator.wikimedia.org/T304314) (owner: 10Urbanecm) [17:01:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:25] (03CR) 10Jbond: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [17:02:32] (03CR) 10MVernon: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [17:03:01] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1148.mgmt.eqiad.wmnet with reboot policy FORCED [17:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:22] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED [17:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:56] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4516 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:04:04] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 05d55a9: fawiki: Set new year celebration (T304314; 1/3) (duration: 00m 50s) [17:04:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) [17:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:08] T304314: Requesting temporary logo change for fa.wikipedia.org - https://phabricator.wikimedia.org/T304314 [17:04:40] 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10KFrancis) @jbond The agreement has been signed. Please proceed with the access request. Thanks! [17:05:20] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) 05In progress→03Resolved [17:05:26] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [17:06:04] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 05d55a9: fawiki: Set new year celebration (T304314; 2/3) (duration: 00m 49s) [17:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:37] (03CR) 10Bking: [C: 03+2] elastic: Bring back stopping new replicas during restart [cookbooks] - 10https://gerrit.wikimedia.org/r/773562 (owner: 10Ebernhardson) [17:06:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:11] !log urbanecm@deploy1002 Synchronized logos/config.yaml: 05d55a9: fawiki: Set new year celebration (T304314; 3/3) (duration: 00m 49s) [17:07:11] (03CR) 10Bking: [C: 03+2] elastic: Remove noqa from rolling-operation.py [cookbooks] - 10https://gerrit.wikimedia.org/r/773561 (owner: 10Ebernhardson) [17:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:16] * urbanecm done [17:09:24] PROBLEM - Maps - OSM synchronization lag - codfw on alert1001 is CRITICAL: 2.598e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=12 [17:10:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:10:12] (03Merged) 10jenkins-bot: elastic: Remove noqa from rolling-operation.py [cookbooks] - 10https://gerrit.wikimedia.org/r/773561 (owner: 10Ebernhardson) [17:10:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED [17:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:21] (03Merged) 10jenkins-bot: elastic: Bring back stopping new replicas during restart [cookbooks] - 10https://gerrit.wikimedia.org/r/773562 (owner: 10Ebernhardson) [17:10:38] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED [17:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:10:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED [17:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:30] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.599e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11 [17:11:58] 10SRE, 10Analytics, 10Data-Engineering, 10Event-Platform, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10BTullis) [17:12:04] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED [17:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:05] (03PS7) 10Phuedx: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) [17:15:30] (03CR) 10Phuedx: Request high-entropy Sec-CH-UA* client hints (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [17:15:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) 1143, 1147 and 1148 did not respond to the provision script [17:15:46] (03CR) 10Bking: [C: 03+2] [wdqs] test jvmquake options on the public cluster [puppet] - 10https://gerrit.wikimedia.org/r/770978 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [17:18:19] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:23:23] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.1699 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:23:26] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10EChetty) [17:23:32] (03CR) 10MSantos: [C: 03+2] mobileapps: Bunp to 2022-03-24-135848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/773595 (owner: 10MSantos) [17:24:32] mmm there seems to be a big set of failures for the Exec verify-envoy-config [17:24:59] rzl: --^ [17:26:16] (03PS5) 10Zabe: Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) [17:27:21] Proto constraint validation failed (field: "upstream_protocol_options", reason: is required) [17:27:37] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4032 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:28:38] I guess it is https://gerrit.wikimedia.org/r/c/operations/puppet/+/773364 ? [17:29:09] (03Merged) 10jenkins-bot: mobileapps: Bunp to 2022-03-24-135848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/773595 (owner: 10MSantos) [17:29:56] (03CR) 10Ottomata: [C: 03+1] "I'm not 100% sure, but this looks correct to me, you don't need Notify" [puppet] - 10https://gerrit.wikimedia.org/r/773387 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi) [17:30:03] (03CR) 10Jbond: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [17:30:11] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:30:56] (03PS1) 10Andrew Bogott: toolsbeta: update nfs server location [puppet] - 10https://gerrit.wikimedia.org/r/773601 [17:31:55] (03PS1) 10Zabe: Use $wmgUseRestbaseVRS in comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773602 (https://phabricator.wikimedia.org/T45956) [17:31:58] (03PS2) 10Cathal Mooney: Add template to configure IPv6 RAs on CRs and L3 Switches [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758) [17:32:19] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [17:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:52] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [17:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:34:14] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [17:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [17:34:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [17:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance [17:34:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance [17:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23035 and previous config saved to /var/cache/conftool/dbconfig/20220324-173450-marostegui.json [17:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:02] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [17:35:02] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:04] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:31] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [17:36:31] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [17:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:37] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06452 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:36:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P23036 and previous config saved to /var/cache/conftool/dbconfig/20220324-173638-root.json [17:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:57] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:53] (03CR) 10Andrew Bogott: [C: 03+2] toolsbeta: update nfs server location [puppet] - 10https://gerrit.wikimedia.org/r/773601 (owner: 10Andrew Bogott) [17:39:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:41:04] (03CR) 10Ladsgroup: "Hear me out." [software] - 10https://gerrit.wikimedia.org/r/773440 (https://phabricator.wikimedia.org/T303605) (owner: 10Marostegui) [17:41:10] (03CR) 10Jbond: [C: 03+2] P:idp::client::https::site: Add check_http_expiry to idp services [puppet] - 10https://gerrit.wikimedia.org/r/773560 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [17:42:34] (03PS4) 10Ladsgroup: orchestrator: Use macros in apache config. [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [17:42:41] (03PS5) 10Ladsgroup: orchestrator: Use macros in apache config [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [17:42:41] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3871 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:43:00] (03PS6) 10Ladsgroup: orchestrator: Use macros in apache config [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [17:43:06] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] orchestrator: Use macros in apache config [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [17:44:32] !log bking@cumin1001 START - Cookbook sre.wdqs.restart [17:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:46] 10SRE, 10Data-Engineering-Radar, 10Traffic: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617 (10EChetty) [17:50:43] (03CR) 10Herron: [C: 03+1] "LGTM 👍" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus) [17:51:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P23037 and previous config saved to /var/cache/conftool/dbconfig/20220324-175142-root.json [17:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) [17:55:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) 05Open→03Resolved Completed [17:57:09] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS buster [17:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster [17:58:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster [17:58:40] (03PS1) 10Zabe: Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956) [17:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:42] (03PS1) 10Zabe: Migrate $wmfAllServices to $wmgAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773608 (https://phabricator.wikimedia.org/T45956) [17:58:46] (03PS1) 10Zabe: Stop writing to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773609 (https://phabricator.wikimedia.org/T45956) [17:58:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster [17:58:58] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [17:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:59:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1144.eqiad.wmnet with OS buster [17:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster [17:59:50] (03CR) 10jerkins-bot: [V: 04-1] Stop writing to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773609 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [17:59:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1145.eqiad.wmnet with OS buster [18:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] dancy, hashar, brennen, dduvall, jeena, and jnuche: That opportune time is upon us again. Time for a 🚂🧪Trainsperiment Week Deploy deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T1800). [18:00:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster [18:01:13] dancy: dduvall: jeena: brennen: so we do another extra train today? :-] [18:01:30] nope, we're done. :) [18:01:33] no sir [18:01:52] (03CR) 10Jbond: "in case you missed it this changed cause verify-envoy-config to fail, there is some additional chat in #w-serviceops" [puppet] - 10https://gerrit.wikimedia.org/r/773364 (https://phabricator.wikimedia.org/T303230) (owner: 10RLazarus) [18:02:23] 4 trains in 4 days 🌅🤠🐎 [18:03:18] (03CR) 10RLazarus: envoy: Move upstream HTTP config into the new HttpProtocolOptions message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773364 (https://phabricator.wikimedia.org/T303230) (owner: 10RLazarus) [18:03:37] (03PS1) 10RLazarus: Revert "envoy: Move upstream HTTP config into the new HttpProtocolOptions message" [puppet] - 10https://gerrit.wikimedia.org/r/773532 [18:05:26] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [18:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:54] (03PS2) 10Zabe: Migrate $wmfAllServices to $wmgAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773608 (https://phabricator.wikimedia.org/T45956) [18:06:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P23038 and previous config saved to /var/cache/conftool/dbconfig/20220324-180646-root.json [18:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:52] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS buster [18:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:00] (03PS2) 10Zabe: Stop writing to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773609 (https://phabricator.wikimedia.org/T45956) [18:07:24] (03CR) 10RLazarus: [C: 03+2] Revert "envoy: Move upstream HTTP config into the new HttpProtocolOptions message" [puppet] - 10https://gerrit.wikimedia.org/r/773532 (owner: 10RLazarus) [18:07:45] (03PS1) 10Jbond: P:idp::client::httpd: fix expiry check [puppet] - 10https://gerrit.wikimedia.org/r/773611 [18:07:50] (03PS2) 10RLazarus: Revert "envoy: Move upstream HTTP config into the new HttpProtocolOptions message" [puppet] - 10https://gerrit.wikimedia.org/r/773532 (https://phabricator.wikimedia.org/T303230) [18:08:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1147.eqiad.wmnet with OS buster [18:08:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:idp::client::httpd: fix expiry check [puppet] - 10https://gerrit.wikimedia.org/r/773611 (owner: 10Jbond) [18:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster [18:08:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster [18:08:30] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) 05Resolved→03In progress [18:09:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1148.eqiad.wmnet with OS buster [18:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Use $wmgUseRestbaseVRS in comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773602 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [18:10:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster [18:12:51] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: introduce cookbook to build/deploy all k8s components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 [18:14:47] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001593 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:15:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) [18:17:15] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [18:17:15] (03CR) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize build code into a class (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773510 (owner: 10Arturo Borrero Gonzalez) [18:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:35] (03PS1) 10Cmjohnson: Updating site.pp for an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/773613 (https://phabricator.wikimedia.org/T293922) [18:17:50] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [18:17:50] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:18:10] (03CR) 10jerkins-bot: [V: 04-1] Updating site.pp for an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/773613 (https://phabricator.wikimedia.org/T293922) (owner: 10Cmjohnson) [18:18:56] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:20:02] (03PS2) 10Cmjohnson: Updating site.pp for an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/773613 (https://phabricator.wikimedia.org/T293922) [18:20:04] (03PS3) 10Krinkle: static.php: Fold "current" handling into "nohash" and extend TTL to 1y [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771357 (https://phabricator.wikimedia.org/T302465) [18:20:08] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:21:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P23039 and previous config saved to /var/cache/conftool/dbconfig/20220324-182150-root.json [18:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:23] (03PS1) 10Razzi: wikireplicas: remove wb_changes_dispatch view for dropped table [puppet] - 10https://gerrit.wikimedia.org/r/773614 (https://phabricator.wikimedia.org/T304591) [18:24:01] (03CR) 10Cmjohnson: [C: 03+2] Updating site.pp for an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/773613 (https://phabricator.wikimedia.org/T293922) (owner: 10Cmjohnson) [18:24:03] (03Abandoned) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond) [18:24:04] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:24:12] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01613 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:24:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773585 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [18:26:30] (03Restored) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond) [18:26:35] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1142.eqiad.wmnet with OS buster [18:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmn... [18:26:51] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1143.eqiad.wmnet with OS buster [18:26:53] (03PS1) 10Zabe: filtered_tables.txt: remove gu_enabled and gu_enabled_method columns [puppet] - 10https://gerrit.wikimedia.org/r/773616 (https://phabricator.wikimedia.org/T303266) [18:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmn... [18:28:08] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) Raid testing. I can poll the controller for basic info: root@dumpsdata1007:~# perccli64 /c0/dall show and get BBU info: perccli64 /c0/bbu show all perccli64 /c0/d0 show I don't get how to poll fo... [18:28:11] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1145.eqiad.wmnet with OS buster [18:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmn... [18:28:46] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1144.eqiad.wmnet with OS buster [18:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmn... [18:31:19] (03PS4) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 [18:33:11] (03CR) 10jerkins-bot: [V: 04-1] R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond) [18:34:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:35:03] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1146.eqiad.wmnet with OS buster [18:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmn... [18:35:54] (03PS5) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 [18:36:02] (03CR) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond) [18:36:28] !log razzi@deneb:~$ sudo docker system prune (reclaimed 33GB) [18:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:52] (03PS38) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [18:36:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P23040 and previous config saved to /var/cache/conftool/dbconfig/20220324-183654-root.json [18:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:18] (03CR) 10jerkins-bot: [V: 04-1] R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond) [18:37:33] (03PS6) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 [18:37:37] (03CR) 10Ladsgroup: [C: 03+1] filtered_tables.txt: remove gu_enabled and gu_enabled_method columns [puppet] - 10https://gerrit.wikimedia.org/r/773616 (https://phabricator.wikimedia.org/T303266) (owner: 10Zabe) [18:38:12] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [18:38:30] (03PS7) 10Jbond: R:tlsproxy: Add missing documentation and remove some v2/v3 compat [puppet] - 10https://gerrit.wikimedia.org/r/771930 [18:44:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS buster [18:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad... [18:46:08] (03PS8) 10Jbond: R:tlsproxy: Add missing documentation and remove some v2/v3 compat [puppet] - 10https://gerrit.wikimedia.org/r/771930 [18:47:18] (03PS9) 10Jbond: R:tlsproxy: Add missing documentation and remove some v2/v3 compat [puppet] - 10https://gerrit.wikimedia.org/r/771930 [18:47:40] 10SRE: Automatically prune docker to clear disk space on deneb.codfw.wmnet - https://phabricator.wikimedia.org/T304644 (10razzi) [18:48:48] (03PS1) 10Razzi: package_builder: run docker prune on a timer [puppet] - 10https://gerrit.wikimedia.org/r/773622 (https://phabricator.wikimedia.org/T304644) [18:49:45] (03PS1) 10Andrew Bogott: toolsbeta: update nfs server location [puppet] - 10https://gerrit.wikimedia.org/r/773623 [18:49:54] RECOVERY - Disk space on deneb is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops [18:50:20] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34543/console" [puppet] - 10https://gerrit.wikimedia.org/r/773622 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [18:50:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:44] (03CR) 10Razzi: [V: 03+1] "As discussed in #-sre" [puppet] - 10https://gerrit.wikimedia.org/r/773622 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [18:52:51] (03CR) 10Andrew Bogott: [C: 03+2] toolsbeta: update nfs server location [puppet] - 10https://gerrit.wikimedia.org/r/773623 (owner: 10Andrew Bogott) [18:53:46] (03CR) 10Dzahn: [C: 03+1] "given this is like existing modules/profile/manifests/ci/docker.pp it looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/773622 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [18:56:55] (03CR) 10CDanis: [C: 03+2] maps: allow bbcrewind to access maps public urls [puppet] - 10https://gerrit.wikimedia.org/r/772462 (https://phabricator.wikimedia.org/T297968) (owner: 10MSantos) [18:57:01] (03CR) 10RLazarus: [C: 03+1] "LGTM as long as PCC is still happy, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond) [18:59:11] (03CR) 10Razzi: [V: 03+1 C: 03+2] package_builder: run docker prune on a timer [puppet] - 10https://gerrit.wikimedia.org/r/773622 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [19:00:02] (03CR) 10Jbond: [C: 03+2] R:tlsproxy: Add missing documentation and remove some v2/v3 compat [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond) [19:00:32] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:01:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:10] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1142.eqiad.wmnet with OS buster [19:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmn... [19:02:49] 10SRE, 10Patch-For-Review: Automatically prune docker to clear disk space on deneb.codfw.wmnet - https://phabricator.wikimedia.org/T304644 (10razzi) 05Open→03Resolved a:03razzi Timers are present! ` razzi@deneb:~$ systemctl list-timers | grep docker ... Fri 2022-03-25 03:58:40 UTC 8h left n... [19:06:14] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 24): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34544/console" [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond) [19:07:10] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:13:52] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3871 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:20:18] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:34] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1147.eqiad.wmnet with OS buster [19:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster exec... [19:21:58] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1148.eqiad.wmnet with OS buster [19:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster exec... [19:22:30] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_imagecatalog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:27:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23041 and previous config saved to /var/cache/conftool/dbconfig/20220324-192741-marostegui.json [19:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:48] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [19:33:36] (03CR) 10Dzahn: "I like that we avoid spamming the channel, I agree as well that "could not load file" should be a warning. The only concern I have that th" [puppet] - 10https://gerrit.wikimedia.org/r/767729 (https://phabricator.wikimedia.org/T302832) (owner: 10Filippo Giunchedi) [19:35:27] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [19:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS b... [19:42:28] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:42:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P23042 and previous config saved to /var/cache/conftool/dbconfig/20220324-194246-marostegui.json [19:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:36] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:44:46] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:45:21] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott I think this is just a wdqs person cleaning up https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:49:01] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.371 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:53:06] RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:56:33] (03CR) 10Dzahn: [C: 03+1] hieradata: remove unused deployment-prep scap targets [puppet] - 10https://gerrit.wikimedia.org/r/770880 (owner: 10Majavah) [19:57:18] Welcome back dzahn! [19:57:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P23043 and previous config saved to /var/cache/conftool/dbconfig/20220324-195752-marostegui.json [19:57:53] thanks dancy [19:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:08] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1024.eqiad.wmnet DHCP problems - https://phabricator.wikimedia.org/T303773 (10Andrew) 05Open→03Resolved [20:00:05] brennen: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220324T2000). Please do the needful. [20:00:05] jan_drewniak, Lucas_WMDE, and zabe: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] o/ [20:01:29] 10SRE, 10Wikimedia-Mailing-lists: Mailman3: 550-Support for list subscription via email has been disabled. - https://phabricator.wikimedia.org/T303888 (10Ladsgroup) Yup, this is something we carried over from mailman2 given the history of abuse with mass subscription via email. Where is it being advertised? [20:01:50] howdy zabe [20:01:53] Hey mutante [20:01:56] wb mutante! [20:02:44] (03CR) 10Thcipriani: [C: 03+2] Use $wmgUseRestbaseVRS in comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773602 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:03:33] (03Merged) 10jenkins-bot: Use $wmgUseRestbaseVRS in comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773602 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:03:38] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye [20:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bulls... [20:03:45] brennen: RhinosF1: :) *wave* [20:03:53] jan_drewniak: around? [20:04:06] How was your holiday mutante [20:04:31] tab completion is failing me for Lucas_WMDE [20:04:49] He's not here [20:05:05] @seen Lucas_WMDE [20:05:10] Left 45 min ago thcipriani [20:05:24] 19:13:47 ⇐︎ Lucas_WMDE quit (~Lucas_WMD@user/lucas-wmde/x-3192532): Quit: Lucas_WMDE [20:05:29] RhinosF1: pretty good, thank you [20:05:45] !seen Lucas_WMDE [20:05:56] I keep forgetting the right prefix but it used to work [20:06:46] @ping [20:06:54] wm-bot: hi [20:07:05] http://wm-bot.wmcloud.org/dump/%23wikimedia-operations.htm [20:07:05] @info [20:07:22] @seen mutante [20:07:27] Hmm [20:07:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:28] 10SRE, 10Data-Engineering-Radar, 10Traffic: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617 (10odimitrijevic) [20:07:28] Weird [20:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:36] yea, we had this feature in the past [20:07:51] not 100% sure if it was from wm-bot [20:08:16] @helo [20:08:18] !log thcipriani@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:773602|Use $wmgUseRestbaseVRS in comment (T45956)]] (duration: 01m 05s) [20:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:22] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:08:23] I am running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 2.8.1.0 [libirc v. 1.0.3] my source code is licensed under GPL and located at https://github.com/benapetr/wikimedia-bot I will be very happy if you fix my bugs or implement new features [20:08:23] @help [20:08:41] mutante: it's pm only [20:08:52] RhinosF1: aha! thanks [20:09:18] brennen: hi, sorry I'm late, I can do my deploy at the end [20:09:52] thanks jan_drewniak. ^ cc: thcipriani [20:11:04] zabe: for 768255 > Some of these do have non-variable usage, such as in the hook for siteinfo API, as used in Puppet code. — does that mean these are *still used* in puppet code? [20:11:58] (03PS3) 10Thcipriani: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773380 (https://phabricator.wikimedia.org/T282012) (owner: 10Jdrewniak) [20:12:12] (03CR) 10Thcipriani: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773380 (https://phabricator.wikimedia.org/T282012) (owner: 10Jdrewniak) [20:12:40] thcipriani: the fields for the siteinfo API are set here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/CommonSettings.php#552 [20:12:45] I don't touch that part [20:12:57] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773380 (https://phabricator.wikimedia.org/T282012) (owner: 10Jdrewniak) [20:12:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23044 and previous config saved to /var/cache/conftool/dbconfig/20220324-201257-marostegui.json [20:12:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [20:13:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [20:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:02] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [20:13:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23045 and previous config saved to /var/cache/conftool/dbconfig/20220324-201305-marostegui.json [20:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:29] zabe: ah, ok, misread the message, thanks :) [20:15:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:15:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:45] jan_drewniak: I misread your message, too -- your change is staged on mwdebug1002, possible to check there? (I forget how portals works) [20:16:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:44] (03CR) 10Thcipriani: [C: 03+2] Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:17:58] (03CR) 10Thcipriani: Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:18:00] thcipriani: thanks! I'll check it now [20:18:06] (03PS6) 10Thcipriani: Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:18:17] (03CR) 10Thcipriani: [C: 03+2] Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:18:59] thcipriani: looks good to sync [20:19:03] (03Merged) 10jenkins-bot: Stop writing to certain $wmf* global variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:19:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [20:19:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [20:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:27] jan_drewniak: cool -- sync-portals is still the right magic for this? [20:20:40] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01613 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:21:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:23] thcipriani: yup :) [20:22:24] !log thcipriani@deploy1002 Synchronized portals/wikipedia.org/assets: Config: [[gerrit:773380|Bumping portals to master (T282012)]] (duration: 00m 52s) [20:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:29] T282012: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 [20:22:41] 10SRE, 10Observability-Logging: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10colewhite) >>! In T246998#7804127, @MoritzMuehlenhoff wrote: > Since we've replaced Kibana with Opensearch Dashboards we now actually can use OIDC or SAML it seems: Indeed! We have asked Legal to clarify if t... [20:23:16] !log thcipriani@deploy1002 Synchronized portals: Config: [[gerrit:773380|Bumping portals to master (T282012)]] (duration: 00m 52s) [20:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:24] ^ jan_drewniak all done! [20:23:38] thcipriani: thanks! [20:24:24] jan_drewniak: yw :) [20:24:52] zabe: your second patch is live on mwdebug (in case there's anything to specific you wanted to test aside from making sure nothing explodes :)) [20:25:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:25:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:40] thcipriani, nothing seems to explode and logstash is clear, so I would say we are good to go [20:26:55] perfect, thanks [20:28:30] !log thcipriani@deploy1002 Synchronized tests: Config: [[gerrit:768255|Stop writing to certain $wmf* global variables (T45956)]] (part I) (duration: 00m 50s) [20:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:36] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:29:58] !log thcipriani@deploy1002 Synchronized docroot/noc/db.php: Config: [[gerrit:768255|Stop writing to certain $wmf* global variables (T45956)]] (part II) (duration: 00m 51s) [20:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:10] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:30] !log thcipriani@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:768255|Stop writing to certain $wmf* global variables (T45956)]] (part 3) (duration: 00m 55s) [20:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:31:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:12] ^ zabe that's patch #2 [20:32:24] thx [20:32:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:09] (03CR) 10Thcipriani: [C: 03+2] Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:33:16] (03CR) 10Thcipriani: Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:33:35] fun [20:33:42] zabe: could you rebase your last one for me [20:33:52] (03CR) 10jerkins-bot: [V: 04-1] Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:34:09] ah, sure [20:34:41] (03CR) 10AGueyte: [C: 03+1] Set IPInfo config for path to MaxMind files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773517 (https://phabricator.wikimedia.org/T304604) (owner: 10Tchanders) [20:35:41] (03PS2) 10Zabe: Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956) [20:35:52] thcipriani: hello, can you please ping me when you're done? I'd like to do a workaround for T304529 please. [20:35:52] T304529: scap update-interwiki-cache throws MWException: Setting $wgInterwikiCache to a CDB path is no longer supported - https://phabricator.wikimedia.org/T304529 [20:36:23] urbanecm: yep, will do [20:36:31] appreciated [20:36:36] thcipriani, rebased [20:37:02] (03CR) 10Thcipriani: [C: 03+2] Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:37:49] thanks zabe [20:38:02] (03Merged) 10jenkins-bot: Start writing to $wmgAllServices the same value as to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773607 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:40:38] zabe: live on mwdebug1002 for any checks you'd like to do [20:41:44] thcipriani, lgtm [20:41:55] thanks for checking :) [20:42:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:46] !log thcipriani@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:773607|Start writing to $wmgAllServices the same value as to $wmfAllServices (T45956)]] (duration: 01m 17s) [20:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:50] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:43:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:43:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:10] ^ zabe should be live, nice low bug number for that one :) [20:44:28] thanks for your help :) [20:44:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:44] thanks for the patch :) [20:46:29] (03CR) 10BBlack: [C: 03+2] map France to drmrs [dns] - 10https://gerrit.wikimedia.org/r/773245 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [20:54:53] (03PS1) 10Urbanecm: fawiki: Set celebration logo for new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773633 (https://phabricator.wikimedia.org/T304314) [20:54:58] thcipriani: i see all scheduled patches (but the one from Lucas) are deployed already -- just a reminder that I'd like to do some deployments too :)) [20:58:21] urbanecm: whoops, sorry, you're clear [20:58:26] thanks! [20:58:28] taking over [21:01:08] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773634 [21:01:19] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773634 (owner: 10Urbanecm) [21:02:06] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) [21:02:16] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773634 (owner: 10Urbanecm) [21:03:19] !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 00m 50s) [21:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:47] (03PS2) 10Urbanecm: fawiki: Set celebration logo for new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773633 (https://phabricator.wikimedia.org/T304314) [21:03:51] (03CR) 10Urbanecm: [C: 03+2] fawiki: Set celebration logo for new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773633 (https://phabricator.wikimedia.org/T304314) (owner: 10Urbanecm) [21:04:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:11] (03Merged) 10jenkins-bot: fawiki: Set celebration logo for new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773633 (https://phabricator.wikimedia.org/T304314) (owner: 10Urbanecm) [21:05:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:05:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:04] !log thcipriani@deploy1002 Started deploy [releng/phatality@15f8ec0]: Deploying phatality updates for opensearch 1.2.0 [21:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:18] !log thcipriani@deploy1002 Finished deploy [releng/phatality@15f8ec0]: Deploying phatality updates for opensearch 1.2.0 (duration: 00m 13s) [21:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:27] !log urbanecm@deploy1002 Synchronized static/images/mobile/copyright/wikipedia-fawiki-new-year.png: 43385320f417052d8e60791b3cb970e6e3f088d5: fawiki: Set celebration logo for new vector (T304314; 1/2) (duration: 00m 50s) [21:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:34] T304314: Requesting temporary logo change for fa.wikipedia.org - https://phabricator.wikimedia.org/T304314 [21:09:09] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 43385320f417052d8e60791b3cb970e6e3f088d5: fawiki: Set celebration logo for new vector (T304314; 2/2) (duration: 00m 53s) [21:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:20] !log bking@cumin1001 restarting blazegraph on wdqs[1003-1013].eqiad.wmnet for T293862 [21:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:26] T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862 [21:11:33] * urbanecm done [21:11:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:12:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:49] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [21:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS b... [21:26:33] 10SRE, 10envoy, 10serviceops: Better automated validation of Puppet-generated Envoy configs - https://phabricator.wikimedia.org/T304660 (10RLazarus) p:05Triage→03Medium [21:27:28] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:27:32] (03PS1) 10Razzi: docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) [21:28:49] (03PS2) 10Razzi: docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) [21:31:06] (03CR) 10Cwhite: Add marcusolsson-json-datasource (031 comment) [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773456 (https://phabricator.wikimedia.org/T304585) (owner: 10Phedenskog) [21:31:59] (03PS1) 10RLazarus: envoyproxy: Fix most validation errors in the `good` build_envoy_config tests [puppet] - 10https://gerrit.wikimedia.org/r/773642 (https://phabricator.wikimedia.org/T304660) [21:32:11] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:33:42] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:10] (03CR) 10Cwhite: Add marcusolsson-json-datasource (031 comment) [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/773456 (https://phabricator.wikimedia.org/T304585) (owner: 10Phedenskog) [21:35:24] (03PS3) 10Razzi: docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) [21:35:40] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:51] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:36:32] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34546/console" [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [21:36:58] (03CR) 10jerkins-bot: [V: 04-1] docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [21:37:46] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:38:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:41] (03PS4) 10Razzi: docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) [21:41:19] (03CR) 10jerkins-bot: [V: 04-1] docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [21:41:24] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34547/console" [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [21:42:01] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye [21:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bulls... [21:44:16] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:45:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23047 and previous config saved to /var/cache/conftool/dbconfig/20220324-214515-marostegui.json [21:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:20] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [21:47:50] (03CR) 10RLazarus: envoyproxy: Fix most validation errors in the `good` build_envoy_config tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773642 (https://phabricator.wikimedia.org/T304660) (owner: 10RLazarus) [21:49:59] (03PS5) 10Razzi: docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) [21:51:06] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34548/console" [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [21:54:06] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [21:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS b... [21:54:20] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:55:07] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Dzahn) [21:59:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [22:00:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P23048 and previous config saved to /var/cache/conftool/dbconfig/20220324-220021-marostegui.json [22:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:46] (03PS1) 10Dzahn: geoip::data::maxmind: deactivate timer for downloading of legacy DBs [puppet] - 10https://gerrit.wikimedia.org/r/773648 (https://phabricator.wikimedia.org/T303464) [22:06:23] (03PS1) 10Dzahn: puppetmaster::geoip: stop using class for legacy maxmind downloads in prod [puppet] - 10https://gerrit.wikimedia.org/r/773649 (https://phabricator.wikimedia.org/T303464) [22:06:50] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage [22:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:16] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::geoip: stop using class for legacy maxmind downloads in prod [puppet] - 10https://gerrit.wikimedia.org/r/773649 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [22:07:25] !log restart wcqs-blazegraph on wcqs2001 to resolve intermittant BlazegraphFreeAllocatorsDecreasingRapidly [22:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:42] (03CR) 10Razzi: [V: 03+1] "As you recommended @hashar I pulled the pruning into a new profile docker::prune; I'm not sure how to factor in the the `if $::realm` chec" [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [22:07:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10cmooney) Ok so just to document we had an issue with the imaging of this, similar to the one in T303296. I had disabled option 82 in... [22:08:01] 10SRE, 10Wikimedia-Mailing-lists: Email spam from varying tawk.email addresses - https://phabricator.wikimedia.org/T304390 (10Ladsgroup) If that website is a known spam source, simply add a global ban like: `.+\.tawk\.email$` [22:10:14] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage [22:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:59] I just remembered I’d signed up for the evening backport window [22:11:07] sorry about that mutante thcipriani [22:11:14] the config change can wait for Monday, it’s not the end of the world [22:12:30] jouncebot: now [22:12:30] No deployments scheduled for the next 8 hour(s) and 47 minute(s) [22:13:13] Lucas_WMDE: I just wanted to try the "seen" feature :) [22:13:18] ^^ [22:13:39] there’s also some command that would’ve sent me a message once I came back [22:13:46] I’ve never used it myself but someone did it to me not long ago ^^ [22:13:53] might be @notify, not sure [22:14:02] ah, yea [22:14:12] a rarely used feature that is actually pretty cool [22:14:28] if you still want to deploy, just ask Tyler [22:14:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host restbase2027.mgmt.codfw.wmnet with reboot policy FORCED [22:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P23049 and previous config saved to /var/cache/conftool/dbconfig/20220324-221526-marostegui.json [22:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:25] (03CR) 10Hoo man: [C: 03+2] Change foaf:homepage value from Literal to IRI [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490 (owner: 10Abbe98) [22:16:34] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Dzahn) [22:17:30] (03Merged) 10jenkins-bot: Change foaf:homepage value from Literal to IRI [dumps/dcat] - 10https://gerrit.wikimedia.org/r/773490 (owner: 10Abbe98) [22:19:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [22:19:54] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1047.eqiad.wmnet with OS bullseye [22:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bulls... [22:23:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Papaul) [22:24:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Papaul) 05Open→03Resolved @Andrew this is complete ready for service. [22:27:33] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Papaul) @Andrew Any other issues with 1016 and 1017 ? If no can we please close this task? Thanks. [22:30:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T302658)', diff saved to https://phabricator.wikimedia.org/P23050 and previous config saved to /var/cache/conftool/dbconfig/20220324-223031-marostegui.json [22:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:37] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [22:34:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:40:19] (03PS1) 10Papaul: Add restbase2027 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/773651 (https://phabricator.wikimedia.org/T301399) [22:40:58] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10Papaul) [23:04:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2027.mgmt.codfw.wmnet with reboot policy FORCED [23:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:23] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10Dzahn) I could sponsor hexmode (@MarkAHershberger). [23:05:04] (03CR) 10Papaul: [C: 03+2] Add restbase2027 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/773651 (https://phabricator.wikimedia.org/T301399) (owner: 10Papaul) [23:07:15] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10thcipriani) >>! In T302287#7797522, @jbond wrote: >>>! In T302287#7797378, @KFrancis wrote: >> @jbond I am confirming the signed NDA. Please proceed with the access request. Thanks!... [23:08:57] (03PS2) 10Dzahn: puppetmaster::geoip: stop using class for legacy maxmind downloads in prod [puppet] - 10https://gerrit.wikimedia.org/r/773649 (https://phabricator.wikimedia.org/T303464) [23:33:59] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:35:59] (03PS3) 10Krinkle: Migrate $wmfAllServices to $wmgAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773608 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [23:36:35] (03CR) 10Krinkle: [C: 03+1] "Good to go. As always, stage and verify on mwdebug1002 and confirm there are no errors or exceptions happening prior to syncing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773608 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [23:39:52] (03PS1) 10Ladsgroup: Add fix_user_varbinaries_T298565.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/773655 (https://phabricator.wikimedia.org/T298565) [23:43:35] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) I'm unable to get the disk to go into missing to spin down, spin back up, and set to returned to test rebuilding an array. I can set it to offline, and thats about it. Also unable to determine ho... [23:44:52] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10Dzahn) a:03TomekSikora.Monsoon [23:46:30] 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10Dzahn) @Arnoldokoth Are you already aware of this change? [23:57:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2027.codfw.wmnet with OS buster [23:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:17] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2027.codfw.wmnet with OS buster