[00:04:13] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db1211.eqiad.wmnet with reason: Maintenance [00:04:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db1211.eqiad.wmnet with reason: Maintenance [00:04:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T367856)', diff saved to https://phabricator.wikimedia.org/P70359 and previous config saved to /var/cache/conftool/dbconfig/20241021-000434-ladsgroup.json [00:04:38] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [00:09:09] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1081704 (owner: 10TrainBranchBot) [01:43:16] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T377686 (10ops-monitoring-bot) 03NEW [02:31:29] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:37:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:30] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:02:52] (03PS2) 10STran: Set redirect wiki for Special:GlobalContributions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081138 (https://phabricator.wikimedia.org/T376612) [06:03:01] (03CR) 10STran: Set redirect wiki for Special:GlobalContributions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081138 (https://phabricator.wikimedia.org/T376612) (owner: 10STran) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:07:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:07:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:08:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2203.codfw.wmnet with reason: Maintenance [06:08:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2203.codfw.wmnet with reason: Maintenance [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:11:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 21 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [06:11:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 21 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081138 (https://phabricator.wikimedia.org/T376612) (owner: 10STran) [06:31:29] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1081220 (https://phabricator.wikimedia.org/T359820) (owner: 10BryanDavis) [06:56:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host pc1013.eqiad.wmnet with OS bookworm [06:57:09] (03PS1) 10Arnaudb: mariadb: prepare pc1013 [puppet] - 10https://gerrit.wikimedia.org/r/1081806 (https://phabricator.wikimedia.org/T376387) [06:57:09] (03CR) 10Arnaudb: "server is being reimaged rn" [puppet] - 10https://gerrit.wikimedia.org/r/1081806 (https://phabricator.wikimedia.org/T376387) (owner: 10Arnaudb) [06:58:06] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 153087 [06:58:07] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'email' for AS: 153087 [06:58:12] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 153087 [06:58:46] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 153087 [07:00:04] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241021T0700). [07:00:04] kart_, dcausse, and Tran: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] o/ [07:01:05] 👋 [07:01:08] seems like I'm the only one with a non-config patch to deploy, happy to do this at end [07:01:14] (03PS1) 10Muehlenhoff: Remove Cumin aliases for legacy mail servers [puppet] - 10https://gerrit.wikimedia.org/r/1081807 [07:01:54] here [07:02:08] I'll go with my patch.. [07:03:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080857 (owner: 10KartikMistry) [07:04:18] (03Merged) 10jenkins-bot: Enable Special:Contribute on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080857 (owner: 10KartikMistry) [07:05:28] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1080857|Enable Special:Contribute on bnwiki]] [07:05:29] !log kartik@deploy2002 scap failed: Command '['/usr/bin/scap', 'mwshell', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'rm -f /srv/mediawiki-staging/php-1.43.0-wmf.27/cache/l10n/*.tmp.*']' returned non-zero exit status 126. (scap version: 4.113.0) (duration: 00m 01s) [07:06:22] Seems scap is failing! [07:06:34] Anyone has idea about these errors? [07:08:00] kart_: never seen this one before... :/ [07:09:03] Let me try again.. [07:09:25] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1080857|Enable Special:Contribute on bnwiki]] [07:09:26] !log kartik@deploy2002 scap failed: Command '['/usr/bin/scap', 'mwshell', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'rm -f /srv/mediawiki-staging/php-1.43.0-wmf.27/cache/l10n/*.tmp.*']' returned non-zero exit status 126. (scap version: 4.113.0) (duration: 00m 01s) [07:09:34] ah. same error. [07:10:09] We have to abort the deployment until someone has idea about these. hashar ^^ [07:10:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1013.eqiad.wmnet with reason: host reimage [07:13:17] 126 could be a perm issue? perhaps https://gerrit.wikimedia.org/r/c/operations/puppet/+/1081281 is related? [07:13:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1013.eqiad.wmnet with reason: host reimage [07:14:16] (03PS1) 10Muehlenhoff: Switch parsoid::testing to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1081810 (https://phabricator.wikimedia.org/T349619) [07:16:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast7001.wikimedia.org [07:16:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:17:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:18:05] kart_: will file an issue, seems like a UBN to me [07:19:55] yes! [07:22:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast7001.wikimedia.org [07:23:32] !log installing python-reportlab security updates [07:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:12] (03PS1) 10JMeybohm: KubernetesContainerReachingMemoryLimit: mcrouter constantly alerting [alerts] - 10https://gerrit.wikimedia.org/r/1081899 [07:25:42] (03CR) 10Muehlenhoff: [C:03+2] Switch parsoid::testing to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1081810 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:26:22] (03CR) 10Slyngshede: "Consider adding something like:" [puppet] - 10https://gerrit.wikimedia.org/r/1080034 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [07:26:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:26:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:26:49] (03PS1) 10Máté Szabó: webperf: install php-mbstring [puppet] - 10https://gerrit.wikimedia.org/r/1081900 (https://phabricator.wikimedia.org/T377433) [07:27:03] (03PS4) 10Elukey: sre.hosts.provision: first refactor with vendor-specific classes [cookbooks] - 10https://gerrit.wikimedia.org/r/1080456 (https://phabricator.wikimedia.org/T365372) [07:27:03] (03PS1) 10Elukey: sre.hosts.provision: raise RuntimeError if Redfish returns an error [cookbooks] - 10https://gerrit.wikimedia.org/r/1081901 (https://phabricator.wikimedia.org/T365372) [07:27:34] Tran: we have to cancel the backport window... I filed T377692 for this [07:27:35] T377692: scap fails with non-zero exit status 126 when running mwshell - https://phabricator.wikimedia.org/T377692 [07:29:03] Thanks for looking into it and for the update 🙇 [07:29:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:29:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:29:46] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10245155 (10MoritzMuehlenhoff) [07:31:22] (03PS2) 10Elukey: sre.hosts.provision: raise RuntimeError if Redfish returns an error [cookbooks] - 10https://gerrit.wikimedia.org/r/1081901 (https://phabricator.wikimedia.org/T365372) [07:33:24] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: raise RuntimeError if Redfish returns an error [cookbooks] - 10https://gerrit.wikimedia.org/r/1081901 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [07:36:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1013.eqiad.wmnet with OS bookworm [07:40:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:40:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:50:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:50:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:51:43] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1081220 (https://phabricator.wikimedia.org/T359820) (owner: 10BryanDavis) [07:52:25] (03PS1) 10Brouberol: ceph-csi-cephs: fix RBAC by granting cluster-wide permisions on PVC and storageclasses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081903 (https://phabricator.wikimedia.org/T376406) [07:53:59] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081900 (https://phabricator.wikimedia.org/T377433) (owner: 10Máté Szabó) [08:02:32] (03PS1) 10Muehlenhoff: Apply the Ganeti role to ganeti2037/ganeti2038 [puppet] - 10https://gerrit.wikimedia.org/r/1081904 (https://phabricator.wikimedia.org/T376594) [08:04:38] (03PS1) 10Brouberol: ceph/server: fix the dse-k8s-csi-cephfs according to the CSI doc [puppet] - 10https://gerrit.wikimedia.org/r/1081905 (https://phabricator.wikimedia.org/T376406) [08:05:12] (03CR) 10CI reject: [V:04-1] ceph/server: fix the dse-k8s-csi-cephfs according to the CSI doc [puppet] - 10https://gerrit.wikimedia.org/r/1081905 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:06:10] (03PS2) 10Brouberol: ceph/server: fix the dse-k8s-csi-cephfs according to the CSI doc [puppet] - 10https://gerrit.wikimedia.org/r/1081905 (https://phabricator.wikimedia.org/T376406) [08:06:22] (03CR) 10Muehlenhoff: [C:03+2] Apply the Ganeti role to ganeti2037/ganeti2038 [puppet] - 10https://gerrit.wikimedia.org/r/1081904 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [08:07:42] (03PS3) 10Volans: Fix issues reported by pylint >3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078663 [08:07:42] (03PS1) 10Volans: apiclient: add a generic API client module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081906 [08:07:42] (03PS1) 10Volans: orchestrator: add a new module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081908 [08:09:06] !log jayme@cumin1002 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster staging-eqiad: containerd migration [08:09:41] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster1003.eqiad.wmnet with OS bookworm [08:18:22] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081903 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:20:02] (03CR) 10Brouberol: [C:03+2] ceph-csi-cephs: fix RBAC by granting cluster-wide permisions on PVC and storageclasses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081903 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:21:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:22:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:22:01] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1081905 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:23:00] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1003.eqiad.wmnet with reason: host reimage [08:23:07] (03PS1) 10JMeybohm: Migrate wikikube-worker208[5689] to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1081910 (https://phabricator.wikimedia.org/T362408) [08:23:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:23:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:24:18] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1081901 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [08:24:34] (03CR) 10Volans: [C:03+2] Fix issues reported by pylint >3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078663 (owner: 10Volans) [08:24:51] (03CR) 10Brouberol: [C:03+2] ceph/server: fix the dse-k8s-csi-cephfs according to the CSI doc [puppet] - 10https://gerrit.wikimedia.org/r/1081905 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:26:32] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1003.eqiad.wmnet with reason: host reimage [08:26:40] (03CR) 10Arnaudb: [C:03+1] "this will be handy to gather infos from orchestrator" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081908 (owner: 10Volans) [08:33:14] (03PS1) 10Brouberol: ceph/server: fix typo in caps [puppet] - 10https://gerrit.wikimedia.org/r/1081911 (https://phabricator.wikimedia.org/T376406) [08:33:58] (03CR) 10Btullis: [C:03+1] ceph/server: fix typo in caps [puppet] - 10https://gerrit.wikimedia.org/r/1081911 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:34:32] (03Merged) 10jenkins-bot: Fix issues reported by pylint >3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078663 (owner: 10Volans) [08:35:11] (03CR) 10Brouberol: [C:03+2] ceph/server: fix typo in caps [puppet] - 10https://gerrit.wikimedia.org/r/1081911 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:35:12] (03CR) 10Volans: [C:03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/1081807 (owner: 10Muehlenhoff) [08:41:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2040.codfw.wmnet [08:41:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2041.codfw.wmnet [08:41:30] (03CR) 10Volans: [C:03+1] "LGTM, I tried to follow all the code moves, I can't be sure 100% they are all good but I didn't spot anything wrong and the tests confirms" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080456 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [08:42:36] (03CR) 10Muehlenhoff: [C:03+2] Remove Cumin aliases for legacy mail servers [puppet] - 10https://gerrit.wikimedia.org/r/1081807 (owner: 10Muehlenhoff) [08:44:54] !log jnuche@deploy2002 Installing scap version "4.114.0" for 210 hosts [08:46:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2041.codfw.wmnet [08:46:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2040.codfw.wmnet [08:47:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1003.eqiad.wmnet with OS bookworm [08:48:32] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster1004.eqiad.wmnet with OS bookworm [08:50:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2039.codfw.wmnet [08:50:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2043.codfw.wmnet [08:52:01] dcausse, kart_: I've deployed a new Scap version, hopefully the issue will be gone now [08:52:13] are you still around to try the backport again? [08:52:16] jnuche: thanks! [08:52:53] jnuche: yes I'm around to test if nobody else wants to [08:53:21] !log andrewtavis-wmde@deploy2002 Started deploy [airflow-dags/wmde@d176c47]: (no justification provided) [08:53:24] mine is a backport to an extension so might take time to pass CI [08:53:31] !log andrewtavis-wmde@deploy2002 Finished deploy [airflow-dags/wmde@d176c47]: (no justification provided) (duration: 00m 11s) [08:54:54] dcausse: I think it shouldn't be a problem, there's nothing scheduled for the next hour [08:55:08] jnuche: ok, will deploy [08:55:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2039.codfw.wmnet [08:55:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2043.codfw.wmnet [08:56:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [extensions/CirrusSearch] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1081402 (owner: 10DCausse) [08:56:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [extensions/CirrusSearch] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1081396 (https://phabricator.wikimedia.org/T376715) (owner: 10DCausse) [08:57:57] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1001.eqiad.wmnet [08:58:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2044.codfw.wmnet [08:59:46] (03CR) 10Elukey: [C:03+1] apiclient: add a generic API client module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081906 (owner: 10Volans) [09:00:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet [09:02:48] (03CR) 10Elukey: [C:03+1] "LGTM, I left a comment for APIClientResponseError but if it is not a concern feel free to proceed :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081907 (owner: 10Volans) [09:02:48] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1004.eqiad.wmnet with reason: host reimage [09:03:38] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: first refactor with vendor-specific classes [cookbooks] - 10https://gerrit.wikimedia.org/r/1080456 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:03:45] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: raise RuntimeError if Redfish returns an error [cookbooks] - 10https://gerrit.wikimedia.org/r/1081901 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:03:50] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1001.eqiad.wmnet [09:04:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet [09:04:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2044.codfw.wmnet [09:05:15] 06SRE, 06Infrastructure-Foundations, 10netops: ToR server-move Netbox script adding ".0" to end of interface names - https://phabricator.wikimedia.org/T374024#10245434 (10Aklapper) @cmooney: Could you please answer the last comment? Thanks in advance! :) [09:06:04] (03Abandoned) 10Hashar: systemd::timer::job: relax send_mail_from parameter [puppet] - 10https://gerrit.wikimedia.org/r/1076910 (owner: 10Hashar) [09:06:22] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1004.eqiad.wmnet with reason: host reimage [09:07:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet [09:08:16] (03CR) 10Elukey: [C:03+1] "LGTM, left another nit related to APIClient exception handling, the rest looks good!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081908 (owner: 10Volans) [09:09:05] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:09:15] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:10:22] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:10:31] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:11:04] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:11:13] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:11:30] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet [09:12:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet [09:12:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2038.codfw.wmnet [09:14:47] 06SRE, 06Infrastructure-Foundations, 10netops: ToR server-move Netbox script adding ".0" to end of interface names - https://phabricator.wikimedia.org/T374024#10245492 (10cmooney) 05Open→03Resolved Apologies, yes all this is complete, resolving. [09:16:04] (03Merged) 10jenkins-bot: Fix phan issue with getCounter returning NullMetric|CounterMetric [extensions/CirrusSearch] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1081402 (owner: 10DCausse) [09:16:05] (03Merged) 10jenkins-bot: Do not pass null to DataSender::sendWeightedTagsUpdate $tagWeights [extensions/CirrusSearch] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1081396 (https://phabricator.wikimedia.org/T376715) (owner: 10DCausse) [09:16:23] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1081402|Fix phan issue with getCounter returning NullMetric|CounterMetric]], [[gerrit:1081396|Do not pass null to DataSender::sendWeightedTagsUpdate $tagWeights (T376715)]] [09:16:27] T376715: TypeError: Argument 3 passed to CirrusSearch\DataSender::sendWeightedTagsUpdate() must be of the type array, null given, called in /srv/mediawiki/php-1.43.0-wmf.25/extensions/CirrusSearch/includes/Job/ElasticaWrite.php on line - https://phabricator.wikimedia.org/T376715 [09:18:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2038.codfw.wmnet [09:18:10] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet [09:19:39] (03CR) 10Hnowlan: [C:03+1] Migrate wikikube-worker208[5689] to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1081910 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:19:53] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-serve1009.eqiad.wmnet [09:22:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet [09:24:15] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1009.eqiad.wmnet [09:26:45] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-serve1010.eqiad.wmnet [09:27:08] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1081402|Fix phan issue with getCounter returning NullMetric|CounterMetric]], [[gerrit:1081396|Do not pass null to DataSender::sendWeightedTagsUpdate $tagWeights (T376715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:27:12] T376715: TypeError: Argument 3 passed to CirrusSearch\DataSender::sendWeightedTagsUpdate() must be of the type array, null given, called in /srv/mediawiki/php-1.43.0-wmf.25/extensions/CirrusSearch/includes/Job/ElasticaWrite.php on line - https://phabricator.wikimedia.org/T376715 [09:27:15] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1004.eqiad.wmnet with OS bookworm [09:27:56] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster1005.eqiad.wmnet with OS bookworm [09:28:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10245537 (10elukey) >>! In T371400#10243686, @Jhancock.wm wrote: > > @elukey ran into another issue with the provisioning script. It happened before on a... [09:29:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet [09:29:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet [09:29:37] !log dcausse@deploy2002 dcausse: Continuing with sync [09:31:08] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1010.eqiad.wmnet [09:31:46] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009#10245561 (10Joe) [09:32:30] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-serve1011.eqiad.wmnet [09:33:09] (03CR) 10Slyngshede: [C:03+2] bitu: Add some stewards to the list of account managers [puppet] - 10https://gerrit.wikimedia.org/r/1081220 (https://phabricator.wikimedia.org/T359820) (owner: 10BryanDavis) [09:33:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet [09:33:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2038.codfw.wmnet [09:33:30] (03CR) 10Clément Goubert: "We could maybe use the `semver` module and do" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 (owner: 10Elukey) [09:36:45] (03PS39) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [09:36:53] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1011.eqiad.wmnet [09:39:50] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081402|Fix phan issue with getCounter returning NullMetric|CounterMetric]], [[gerrit:1081396|Do not pass null to DataSender::sendWeightedTagsUpdate $tagWeights (T376715)]] (duration: 23m 26s) [09:39:54] T376715: TypeError: Argument 3 passed to CirrusSearch\DataSender::sendWeightedTagsUpdate() must be of the type array, null given, called in /srv/mediawiki/php-1.43.0-wmf.25/extensions/CirrusSearch/includes/Job/ElasticaWrite.php on line - https://phabricator.wikimedia.org/T376715 [09:40:20] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [09:40:32] jnuche: worked well from my side, thanks for the fix! [09:40:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2038.codfw.wmnet [09:40:54] (03CR) 10Elukey: [C:04-1] "This is a very nice trick! Tried to apply it, but the following test doesn't work:" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 (owner: 10Elukey) [09:41:42] dcausse: awesome! glad it worked :) [09:42:49] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1005.eqiad.wmnet with reason: host reimage [09:45:44] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet [09:46:16] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1005.eqiad.wmnet with reason: host reimage [09:47:05] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-staging2002.codfw.wmnet [09:47:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2037.codfw.wmnet to cluster codfw and group C [09:49:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2037.codfw.wmnet to cluster codfw and group C [09:52:41] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2002.codfw.wmnet [09:53:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:53:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:56:00] (03PS1) 10Brouberol: airflow: define a cephfs PVC storing the DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) [09:57:13] (03PS2) 10Brouberol: airflow: define a cephfs PVC storing the DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) [10:00:07] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241021T1000) [10:01:36] (03PS3) 10Brouberol: airflow: define a cephfs PVC storing the DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) [10:02:00] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephmon1006.eqiad.wmnet [10:03:20] (03PS1) 10Muehlenhoff: Switch cloudcephmon1006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1081925 (https://phabricator.wikimedia.org/T349619) [10:04:29] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephmon1006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1081925 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:07:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:07:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:08:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1005.eqiad.wmnet with OS bookworm [10:08:42] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=0) Reimaging k8s control planes of cluster staging-eqiad: containerd migration [10:10:08] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host cloudcephmon1006.eqiad.wmnet [10:11:30] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: ProbeDown (instance centrallog2002:6514) - https://phabricator.wikimedia.org/T377703 (10LSobanski) 03NEW [10:12:06] (03PS21) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [10:12:16] (03CR) 10Jcrespo: [C:03+2] mariadb: Default pt-heartbeat to STATEMENT-based replication [puppet] - 10https://gerrit.wikimedia.org/r/1081103 (https://phabricator.wikimedia.org/T375144) (owner: 10Jcrespo) [10:14:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:14:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1245.eqiad.wmnet with reason: testing depool/repool [10:14:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:14:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1245.eqiad.wmnet with reason: testing depool/repool [10:15:56] (03PS1) 10Jelto: static-codereview: bump image to 2024-10-21-095944 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081926 (https://phabricator.wikimedia.org/T363771) [10:16:05] (03PS1) 10Jcrespo: Revert "mariadb: Default pt-heartbeat to STATEMENT-based replication" [puppet] - 10https://gerrit.wikimedia.org/r/1081927 [10:16:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1213.eqiad.wmnet with reason: testing depool/repool [10:16:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1213.eqiad.wmnet with reason: testing depool/repool [10:17:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:17:36] (03CR) 10Jelto: "I tested this locally and the newest image should contain the correct index.html now:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081926 (https://phabricator.wikimedia.org/T363771) (owner: 10Jelto) [10:18:30] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1185.eqiad.wmnet with reason: testing depool/repool [10:18:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1185.eqiad.wmnet with reason: testing depool/repool [10:21:56] (03PS2) 10Mhorsey: Release CampaignEvents to eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078907 (https://phabricator.wikimedia.org/T376786) [10:23:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet [10:25:22] (03CR) 10Clément Goubert: [C:03+2] kubestage: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079257 (https://phabricator.wikimedia.org/T376171) (owner: 10Clément Goubert) [10:25:49] (03CR) 10Clément Goubert: [C:03+2] kubernetes: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079239 (https://phabricator.wikimedia.org/T376170) (owner: 10Clément Goubert) [10:25:54] (03CR) 10Clément Goubert: [C:03+2] kubernetes: eqiad expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) (owner: 10Clément Goubert) [10:25:58] (03CR) 10Clément Goubert: [C:03+2] kubernetes: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079241 (https://phabricator.wikimedia.org/T376185) (owner: 10Clément Goubert) [10:26:00] (03CR) 10Clément Goubert: [C:03+2] kubernetes: codfw expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079240 (https://phabricator.wikimedia.org/T376665) (owner: 10Clément Goubert) [10:27:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet [10:31:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2038.codfw.wmnet to cluster codfw and group C [10:31:29] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:32:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2038.codfw.wmnet to cluster codfw and group C [10:32:49] (03CR) 10Elukey: Add a cookbook to roll-reimage stacked k8s control planes (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:33:16] 06SRE, 06collaboration-services, 10WMF-General-or-Unknown, 07Documentation, 13Patch-For-Review: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#10245764 (10Jelto) a:03Jelto [10:36:05] (03PS1) 10Vgutierrez: profile: Remove digicert 2023 [puppet] - 10https://gerrit.wikimedia.org/r/1081928 [10:38:02] (03PS2) 10Volans: apiclient: add a generic API client module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081906 [10:38:02] (03PS2) 10Volans: redfish: use the new apiclient module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081907 [10:38:02] (03PS2) 10Volans: orchestrator: add a new module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081908 [10:38:18] (03CR) 10Volans: "addressed comment" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081906 (owner: 10Volans) [10:38:53] (03CR) 10Volans: "reply/question inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081907 (owner: 10Volans) [10:39:09] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4323/co" [puppet] - 10https://gerrit.wikimedia.org/r/1081928 (owner: 10Vgutierrez) [10:39:21] (03CR) 10Volans: "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081908 (owner: 10Volans) [10:39:50] (03CR) 10Vgutierrez: [V:03+1 C:03+2] profile: Remove digicert 2023 [puppet] - 10https://gerrit.wikimedia.org/r/1081928 (owner: 10Vgutierrez) [10:41:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [10:41:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081415 (https://phabricator.wikimedia.org/T377584) (owner: 10STran) [10:41:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081138 (https://phabricator.wikimedia.org/T376612) (owner: 10STran) [10:42:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080227 (https://phabricator.wikimedia.org/T376132) (owner: 10Kosta Harlan) [10:43:53] (03CR) 10Ladsgroup: [C:03+1] "that's noop, so it should be fine. You don't need to make it pc3 master though. It can stay as a hot spare?" [puppet] - 10https://gerrit.wikimedia.org/r/1081806 (https://phabricator.wikimedia.org/T376387) (owner: 10Arnaudb) [10:47:14] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1029.eqiad.wmnet [10:48:32] (03PS1) 10Muehlenhoff: Switch cloudcephosd1029 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1081930 (https://phabricator.wikimedia.org/T349619) [10:48:39] (03PS1) 10Btullis: Move some HDFS tests from data-platform to data-engineering [alerts] - 10https://gerrit.wikimedia.org/r/1081931 (https://phabricator.wikimedia.org/T376713) [10:50:12] (03CR) 10CI reject: [V:04-1] Move some HDFS tests from data-platform to data-engineering [alerts] - 10https://gerrit.wikimedia.org/r/1081931 (https://phabricator.wikimedia.org/T376713) (owner: 10Btullis) [10:50:46] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1029 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1081930 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:51:29] FIRING: [6x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:51:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [10:51:52] (03CR) 10Elukey: redfish: use the new apiclient module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081907 (owner: 10Volans) [10:52:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [10:52:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [10:52:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [10:52:20] (03CR) 10Elukey: [C:03+1] apiclient: add a generic API client module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081906 (owner: 10Volans) [10:52:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T376905)', diff saved to https://phabricator.wikimedia.org/P70360 and previous config saved to /var/cache/conftool/dbconfig/20241021-105223-ladsgroup.json [10:54:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1029.eqiad.wmnet [10:57:22] (03PS2) 10Btullis: Move some HDFS tests from data-platform to data-engineering [alerts] - 10https://gerrit.wikimedia.org/r/1081931 (https://phabricator.wikimedia.org/T376713) [10:59:40] !log installing curl security updates [10:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T376905)', diff saved to https://phabricator.wikimedia.org/P70361 and previous config saved to /var/cache/conftool/dbconfig/20241021-110136-ladsgroup.json [11:01:53] (03PS1) 10Btullis: Datahub: Increase the RAM for the datahub restore-incides job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081935 (https://phabricator.wikimedia.org/T376657) [11:02:38] (03PS2) 10Btullis: Datahub: Increase the RAM for the datahub restore-incides job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081935 (https://phabricator.wikimedia.org/T376657) [11:02:52] (03CR) 10Alexandros Kosiaris: "I 'd ask how critical that test is and subsequently how much "correct" we want to be. I don't think we ever set a requirement that all ima" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 (owner: 10Elukey) [11:03:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr3-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:04:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:04:29] !incidents [11:04:30] 5333 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr3-eqsin.wikimedia.org) [11:04:30] 5332 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-codfw.wikimedia.org) [11:04:35] !ack 5333 [11:04:35] 5333 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr3-eqsin.wikimedia.org) [11:04:44] (03CR) 10Btullis: [C:03+2] Datahub: Increase the RAM for the datahub restore-incides job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081935 (https://phabricator.wikimedia.org/T376657) (owner: 10Btullis) [11:04:57] arnaudb, tappof ^^ :) [11:05:56] (03Merged) 10jenkins-bot: Datahub: Increase the RAM for the datahub restore-incides job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081935 (https://phabricator.wikimedia.org/T376657) (owner: 10Btullis) [11:07:59] (03PS3) 10Volans: redfish: use the new apiclient module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081907 [11:08:00] (03PS3) 10Volans: orchestrator: add a new module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081908 [11:08:11] (03CR) 10Volans: orchestrator: add a new module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081908 (owner: 10Volans) [11:08:22] (03CR) 10Volans: "addressed comment" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1081907 (owner: 10Volans) [11:09:09] FIRING: [7x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:11:29] FIRING: [6x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:14:09] RESOLVED: [7x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:14:19] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [11:16:29] FIRING: [6x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:16:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [11:16:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P70362 and previous config saved to /var/cache/conftool/dbconfig/20241021-111643-ladsgroup.json [11:17:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [11:17:29] (03PS7) 10Clément Goubert: sre.discovery.datacenter: Add failover_from action [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) [11:17:42] (03CR) 10Clément Goubert: sre.discovery.datacenter: Add failover_from action (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [11:21:29] FIRING: [6x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:26:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10245885 (10MoritzMuehlenhoff) [11:26:55] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10245888 (10MoritzMuehlenhoff) [11:27:11] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10245889 (10MoritzMuehlenhoff) [11:31:43] RESOLVED: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [11:31:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P70363 and previous config saved to /var/cache/conftool/dbconfig/20241021-113150-ladsgroup.json [11:33:57] (03PS2) 10Clément Goubert: mc-gp: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079942 (https://phabricator.wikimedia.org/T376186) [11:34:04] (03PS2) 10Clément Goubert: mc-gp: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079941 (https://phabricator.wikimedia.org/T376968) [11:35:56] (03CR) 10Effie Mouzeli: [C:03+2] mc-gp: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079941 (https://phabricator.wikimedia.org/T376968) (owner: 10Clément Goubert) [11:38:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr3-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:39:59] (03PS3) 10Clément Goubert: mc-gp: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079942 (https://phabricator.wikimedia.org/T376186) [11:40:03] (03CR) 10Effie Mouzeli: [C:03+2] mc-gp: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079942 (https://phabricator.wikimedia.org/T376186) (owner: 10Clément Goubert) [11:40:37] !log installing python-idna security updates [11:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:21] (03PS4) 10Effie Mouzeli: mc-gp: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079942 (https://phabricator.wikimedia.org/T376186) (owner: 10Clément Goubert) [11:45:02] (03Abandoned) 10Effie Mouzeli: WIP: Add mc2038-mc2055 [puppet] - 10https://gerrit.wikimedia.org/r/791583 (https://phabricator.wikimedia.org/T293012) (owner: 10Alexandros Kosiaris) [11:46:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T376905)', diff saved to https://phabricator.wikimedia.org/P70364 and previous config saved to /var/cache/conftool/dbconfig/20241021-114657-ladsgroup.json [11:47:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [11:47:16] (03CR) 10Effie Mouzeli: mc-gp: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079942 (https://phabricator.wikimedia.org/T376186) (owner: 10Clément Goubert) [11:47:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [11:47:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T376905)', diff saved to https://phabricator.wikimedia.org/P70365 and previous config saved to /var/cache/conftool/dbconfig/20241021-114723-ladsgroup.json [11:50:48] (03CR) 10Effie Mouzeli: [C:03+1] KubernetesContainerReachingMemoryLimit: mcrouter constantly alerting [alerts] - 10https://gerrit.wikimedia.org/r/1081899 (owner: 10JMeybohm) [11:51:23] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [11:51:31] (03CR) 10Effie Mouzeli: [C:03+2] mc-gp: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079942 (https://phabricator.wikimedia.org/T376186) (owner: 10Clément Goubert) [11:52:05] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [11:52:19] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [11:52:53] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [11:52:54] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10245933 (10MoritzMuehlenhoff) [11:53:04] (03PS2) 10Hnowlan: thumbor: add mcrouter config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078386 [11:54:43] (03CR) 10Effie Mouzeli: [C:03+1] thumbor: add mcrouter config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078386 (owner: 10Hnowlan) [11:54:54] (03CR) 10Effie Mouzeli: [V:03+2 C:03+2] mc-gp: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079942 (https://phabricator.wikimedia.org/T376186) (owner: 10Clément Goubert) [11:56:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T376905)', diff saved to https://phabricator.wikimedia.org/P70366 and previous config saved to /var/cache/conftool/dbconfig/20241021-115629-ladsgroup.json [11:56:35] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: sync on production [12:00:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [12:01:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:01:24] (03CR) 10EoghanGaffney: [C:03+1] sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1079213 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [12:01:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:09:02] !log klausman@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1002.eqiad.wmnet with OS bookworm [12:09:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10245987 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1002 for host... [12:11:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P70367 and previous config saved to /var/cache/conftool/dbconfig/20241021-121136-ladsgroup.json [12:19:47] (03PS1) 10Btullis: Datahub: disable standalone consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081948 (https://phabricator.wikimedia.org/T376657) [12:20:55] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081950 [12:21:40] !log klausman@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-lab1002.eqiad.wmnet with reason: host reimage [12:24:07] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-lab1002.eqiad.wmnet with reason: host reimage [12:24:39] (03CR) 10JMeybohm: [C:03+2] KubernetesContainerReachingMemoryLimit: mcrouter constantly alerting [alerts] - 10https://gerrit.wikimedia.org/r/1081899 (owner: 10JMeybohm) [12:26:29] (03Merged) 10jenkins-bot: KubernetesContainerReachingMemoryLimit: mcrouter constantly alerting [alerts] - 10https://gerrit.wikimedia.org/r/1081899 (owner: 10JMeybohm) [12:26:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P70368 and previous config saved to /var/cache/conftool/dbconfig/20241021-122644-ladsgroup.json [12:30:25] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Add ATS map for lists.wikimedia.org -> lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1072247 (owner: 10EoghanGaffney) [12:33:00] (03CR) 10Jelto: [C:03+2] sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1079213 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [12:35:25] (03PS1) 10Arnaudb: mariadb: add db2240 as a replica [puppet] - 10https://gerrit.wikimedia.org/r/1081963 (https://phabricator.wikimedia.org/T373579) [12:35:43] (03CR) 10Muehlenhoff: [C:03+2] tlsproxy::envoy: Simplify firewall rule set [puppet] - 10https://gerrit.wikimedia.org/r/1079395 (owner: 10Muehlenhoff) [12:36:36] (03PS22) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [12:36:51] (03CR) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:38:31] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1079213 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [12:40:51] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10246081 (10aborrero) [12:41:51] (03PS23) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [12:41:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T376905)', diff saved to https://phabricator.wikimedia.org/P70369 and previous config saved to /var/cache/conftool/dbconfig/20241021-124151-ladsgroup.json [12:41:57] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [12:42:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [12:42:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T376905)', diff saved to https://phabricator.wikimedia.org/P70370 and previous config saved to /var/cache/conftool/dbconfig/20241021-124217-ladsgroup.json [12:42:41] (03CR) 10Arnaudb: [C:03+2] mariadb: add db2240 as a replica [puppet] - 10https://gerrit.wikimedia.org/r/1081963 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [12:44:29] (03PS24) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [12:44:43] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10246085 (10aborrero) 05Open→03In progress Let me check what is left to be done here. [12:45:35] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-lab1002.eqiad.wmnet with OS bookworm [12:46:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10246093 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1002 for host ml-l... [12:47:47] (03CR) 10Muehlenhoff: "ALl great minds think alike; I had a similar patch pending, which I have now merged: https://gerrit.wikimedia.org/r/c/operations/puppet/+/" [puppet] - 10https://gerrit.wikimedia.org/r/1079542 (https://phabricator.wikimedia.org/T327259) (owner: 10Brouberol) [12:48:32] (03CR) 10Jcrespo: [C:04-2] "Not needed, deployment went well." [puppet] - 10https://gerrit.wikimedia.org/r/1081927 (owner: 10Jcrespo) [12:48:46] (03Abandoned) 10Jcrespo: Revert "mariadb: Default pt-heartbeat to STATEMENT-based replication" [puppet] - 10https://gerrit.wikimedia.org/r/1081927 (owner: 10Jcrespo) [12:50:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T376905)', diff saved to https://phabricator.wikimedia.org/P70371 and previous config saved to /var/cache/conftool/dbconfig/20241021-125029-ladsgroup.json [12:52:25] 06SRE-OnFire, 06Data-Persistence-SRE, 06DBA, 07Sustainability: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144#10246114 (10jcrespo) Deployment went well, I will update the incident doc with the long-term fix and the... [12:53:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet [12:54:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet [12:58:30] (03PS1) 10Arturo Borrero Gonzalez: openstack: remove wikitech profiles and base modules [puppet] - 10https://gerrit.wikimedia.org/r/1081968 (https://phabricator.wikimedia.org/T371378) [12:58:51] (03PS2) 10Arturo Borrero Gonzalez: openstack: remove wikitech profiles and base modules [puppet] - 10https://gerrit.wikimedia.org/r/1081968 (https://phabricator.wikimedia.org/T371378) [12:59:03] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081968 (https://phabricator.wikimedia.org/T371378) (owner: 10Arturo Borrero Gonzalez) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241021T1300). nyaa~ [13:00:05] Daimona and Tran: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] nyaa~ [13:00:19] o/ [13:00:23] (03PS1) 10EoghanGaffney: lists: Fix ATS backend map target for lists.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/1081969 [13:00:25] o/ [13:00:26] 👋 [13:00:26] (03PS10) 10Ayounsi: WIP: first scaffolding for JSON-RPC support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) [13:01:25] I can deploy! [13:02:14] (03PS3) 10Arturo Borrero Gonzalez: openstack: remove wikitech profiles and base modules [puppet] - 10https://gerrit.wikimedia.org/r/1081968 (https://phabricator.wikimedia.org/T371378) [13:02:26] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081968 (https://phabricator.wikimedia.org/T371378) (owner: 10Arturo Borrero Gonzalez) [13:03:23] TIL there’s Extension:WikimediaCampaignEvents [13:03:33] Yeah... [13:03:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081446 (https://phabricator.wikimedia.org/T376055) (owner: 10Daimona Eaytoy) [13:04:16] (03CR) 10CI reject: [V:04-1] WIP: first scaffolding for JSON-RPC support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [13:05:01] (03Merged) 10jenkins-bot: Enable CampaignEvents collaboration list in testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081446 (https://phabricator.wikimedia.org/T376055) (owner: 10Daimona Eaytoy) [13:05:15] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1081446|Enable CampaignEvents collaboration list in testwiki and test2wiki (T376055)]] [13:05:34] T376055: Release Collaboration List MVP to testwiki and test2wiki - https://phabricator.wikimedia.org/T376055 [13:05:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P70372 and previous config saved to /var/cache/conftool/dbconfig/20241021-130538-ladsgroup.json [13:06:13] @Daimona don't say it like that XD [13:06:40] I didn't say anything, I just typed it :P [13:07:44] (03PS4) 10Brouberol: airflow: define a cephfs PVC storing the DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) [13:08:14] hm, one of the testserver checks failed [13:08:15] retrying [13:08:34] the favicon.ico returned text/html (500) [13:08:37] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1081446|Enable CampaignEvents collaboration list in testwiki and test2wiki (T376055)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:40] now it worked [13:08:49] Daimona: can you test the change on mwdebug? [13:10:51] (03PS5) 10Brouberol: airflow: define a cephfs PVC storing the DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) [13:11:15] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: test Ide32aae22b53884a76ab10ee4504052cbe777735 with dummy upgrade [13:11:50] It seems successfully broken for the time being, let me check the logs [13:11:54] (03CR) 10Effie Mouzeli: [C:03+1] openstack: remove wikitech profiles and base modules [puppet] - 10https://gerrit.wikimedia.org/r/1081968 (https://phabricator.wikimedia.org/T371378) (owner: 10Arturo Borrero Gonzalez) [13:12:00] (03CR) 10Xcollazo: [C:03+1] "Can't comment on the YAML files specifically, but +1 in principle as I think DE/DP folks would be in a better position to respond to these" [alerts] - 10https://gerrit.wikimedia.org/r/1081931 (https://phabricator.wikimedia.org/T376713) (owner: 10Btullis) [13:12:02] hm, is that good or not? ^^ [13:12:35] as in, the feature is turned on and doesn’t work properly? [13:12:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: test Ide32aae22b53884a76ab10ee4504052cbe777735 with dummy upgrade [13:13:37] The feature is turned on but requires data from other places that may or may not be available [13:13:44] 10ops-codfw, 06SRE, 06DC-Ops: lsw-d[18]-codfw missing console port info in netbox - https://phabricator.wikimedia.org/T376917#10246240 (10Papaul) @RobH thnak you for opening this task. We need to run 2 new runs from scs1-c1-codfw to those switches and right now we don't have the orange cable to do so. when w... [13:14:37] It may be a case of "Request failed, successfully" [13:14:51] It's there but not working correctly. In the logs I found: Failed to connect to query.wikidata.org port 443: Connection timed out [13:15:03] (03Abandoned) 10Arnaudb: mariadb: prepare pc1013 [puppet] - 10https://gerrit.wikimedia.org/r/1081806 (https://phabricator.wikimedia.org/T376387) (owner: 10Arnaudb) [13:15:29] hm [13:15:44] AFAIK that’s not the right hostname to connect to WDQS internally [13:15:55] or at least it’s not the one we use in WikibaseQualityConstraints [13:16:06] but the search team would know better [13:16:22] (03CR) 10Jelto: [V:03+2 C:03+2] "Cookbook passed with a dummy upgrade on one of the replicas." [cookbooks] - 10https://gerrit.wikimedia.org/r/1079213 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [13:16:58] (03CR) 10Jelto: [C:03+2] static-codereview: bump image to 2024-10-21-095944 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081926 (https://phabricator.wikimedia.org/T363771) (owner: 10Jelto) [13:17:37] ($wgWBQualityConstraintsSparqlEndpoint is currently http://localhost:6009/sparql, though that’s subject to change with T374021) [13:17:37] T374021: Make WikibaseQualityConstraints use split-graph query service - https://phabricator.wikimedia.org/T374021 [13:17:50] Oh that's interesting. [13:17:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2172 to clone on db2240 T373579', diff saved to https://phabricator.wikimedia.org/P70373 and previous config saved to /var/cache/conftool/dbconfig/20241021-131750-arnaudb.json [13:18:05] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [13:18:06] I was just wondering why it would work locally but not in prod. [13:18:09] (03Merged) 10jenkins-bot: static-codereview: bump image to 2024-10-21-095944 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081926 (https://phabricator.wikimedia.org/T363771) (owner: 10Jelto) [13:18:50] probably needs a revert then? and a patch to make https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaCampaignEvents/+/8abe6417228aea6543f37ac6a5ad323f6f5538fe/src/WikiProject/WikiProjectIDLookup.php#97 configurable? [13:19:22] (03CR) 10Btullis: [C:03+2] Move some HDFS tests from data-platform to data-engineering [alerts] - 10https://gerrit.wikimedia.org/r/1081931 (https://phabricator.wikimedia.org/T376713) (owner: 10Btullis) [13:19:34] Looks like it :( [13:19:52] Daimona: there's a SPARQLClient in mediawiki you should perhaps use this? [13:19:58] (03PS6) 10Brouberol: airflow: define a cephfs PVC storing the DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) [13:20:34] (03Merged) 10jenkins-bot: Move some HDFS tests from data-platform to data-engineering [alerts] - 10https://gerrit.wikimedia.org/r/1081931 (https://phabricator.wikimedia.org/T376713) (owner: 10Btullis) [13:20:43] hm, I wonder if there’s a reason why we don’t seem to be using that one in WikibaseQualityConstraints [13:20:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P70374 and previous config saved to /var/cache/conftool/dbconfig/20241021-132045-ladsgroup.json [13:20:54] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/45777c2ec060c61f0650780d7337e0b218cfb273/includes/sparql [13:22:07] dcausse: Thanks for the pointer! Didn't know about that (despite having seen the "sparql" folder multiple times the last few days for unrelated reasons). I knew about SparqlHelper in WBQC only. [13:22:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: provisionning db2240.codfw.wmnet - T373579 [13:22:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: provisionning db2240.codfw.wmnet - T373579 [13:22:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2240.codfw.wmnet with reason: provisionning db2240.codfw.wmnet - T373579 [13:22:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2240.codfw.wmnet with reason: provisionning db2240.codfw.wmnet - T373579 [13:23:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2172 in db2240 for T373579', diff saved to https://phabricator.wikimedia.org/P70375 and previous config saved to /var/cache/conftool/dbconfig/20241021-132351-arnaudb.json [13:24:00] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [13:24:05] Then I guess let's revert, migrate code to SparqlClient, make any necessary adjustments to the hostname, and deploy later this week. @HouseOfM how does that sound? [13:24:13] !log lucaswerkmeister-wmde@deploy2002 Sync cancelled. [13:24:48] (03PS1) 10TrainBranchBot: Revert "Enable CampaignEvents collaboration list in testwiki and test2wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081973 [13:24:48] (03CR) 10TrainBranchBot: "lucaswerkmeister-wmde@deploy2002 created a revert of this change as I42b466f3fef9ae8f8a4f675a3294eac038fd8184" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081446 (https://phabricator.wikimedia.org/T376055) (owner: 10Daimona Eaytoy) [13:24:54] I’m reverting it now FYI [13:24:57] !log bking@stat1008.mgmt racadm>>racadm set BIOS.MemSettings.NodeInterleave Enabled T376813 [13:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:01] T376813: Implement non-cgroups-related performance optimizations on stat hosts - https://phabricator.wikimedia.org/T376813 [13:25:08] (03CR) 10Muehlenhoff: [C:04-1] "These templates also need to be removed, otherwise LGTM:" [puppet] - 10https://gerrit.wikimedia.org/r/1081968 (https://phabricator.wikimedia.org/T371378) (owner: 10Arturo Borrero Gonzalez) [13:25:09] !log bking@stat1008.mgmt racadm>>racadm jobqueue create BIOS.Setup.1-1 [13:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081973 (owner: 10TrainBranchBot) [13:25:55] (03Merged) 10jenkins-bot: Revert "Enable CampaignEvents collaboration list in testwiki and test2wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081973 (owner: 10TrainBranchBot) [13:26:11] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1081973|Revert "Enable CampaignEvents collaboration list in testwiki and test2wiki"]] [13:27:16] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:28:16] ah, I think WBQC can’t directly use SparqlClient at the moment because we need access to the response headers [13:28:31] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, trainbranchbot: Backport for [[gerrit:1081973|Revert "Enable CampaignEvents collaboration list in testwiki and test2wiki"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:28:36] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:28:41] @Daimona already on the new patch :) [13:28:42] (though it would probably be possible to make that work somehow [13:28:54] Daimona or HouseOfM: can you test that it’s disabled again on mwdebug? [13:29:19] HouseOfM: Thanks :) Lucas_WMDE: yep, it's gone for good [13:29:24] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, trainbranchbot: Continuing with sync [13:29:29] good, thanks for checking [13:29:44] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:30:50] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2172.codfw.wmnet onto db2240.codfw.wmnet [13:32:03] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:33:11] !log bking@stat1009,stat1010.mgmt racadm>>racadm set BIOS.MemSettings.NodeInterleave Enabled && racadm jobqueue create BIOS.Setup.1-1 T376813 [13:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:25] T376813: Implement non-cgroups-related performance optimizations on stat hosts - https://phabricator.wikimedia.org/T376813 [13:33:42] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:34:32] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081973|Revert "Enable CampaignEvents collaboration list in testwiki and test2wiki"]] (duration: 08m 20s) [13:34:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7002.magru.wmnet [13:35:01] alright, that’s done [13:35:07] Tran: do you want to self-service your deployments or should I do them? [13:35:28] If you want to, I wouldn't complain but I can do it myself as well. There are a few of them; do you think there's enough time for them all? [13:35:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10246360 (10MoritzMuehlenhoff) [13:35:33] 06SRE-OnFire, 06Data-Persistence-SRE, 06DBA, 07Sustainability: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144#10246357 (10jcrespo) 05Open→03Resolved a:03jcrespo Done: https://wikitech.wikimedia.org/wiki/I... [13:35:44] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:35:48] I was thinking they would be deployed together [13:35:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T376905)', diff saved to https://phabricator.wikimedia.org/P70376 and previous config saved to /var/cache/conftool/dbconfig/20241021-133552-ladsgroup.json [13:35:54] up to you whether that makes sense or not ^^ [13:35:58] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [13:36:00] jouncebot: next [13:36:01] In 1 hour(s) and 53 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241021T1530) [13:36:06] also we have a bit of time after the window [13:36:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [13:36:15] I think 2 batches then - one for the AbuseFilter change and then the stack of Special:GlobalContributions changes? [13:36:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T376905)', diff saved to https://phabricator.wikimedia.org/P70377 and previous config saved to /var/cache/conftool/dbconfig/20241021-133619-ladsgroup.json [13:38:43] Okay, I can do the AbuseFilter one but might need some help with the 3 stack after. [13:39:38] looks like I briefly dropped out of IRC there [13:39:50] Tran: sounds good to me, feel free to go ahead [13:40:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [13:40:48] (03CR) 10Brouberol: [C:03+1] Datahub: disable standalone consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081948 (https://phabricator.wikimedia.org/T376657) (owner: 10Btullis) [13:40:54] deploying multiple changes with scap backport is simple, just call it with multiple URLs as arguments ^^ [13:41:08] (03Merged) 10jenkins-bot: Apply wmf-specific protected vars rights access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [13:41:17] you might have to rebase the changes into one chain, not sure [13:41:21] !log stran@deploy2002 Started scap sync-world: Backport for [[gerrit:1080250|Apply wmf-specific protected vars rights access (T369610)]] [13:41:35] T369610: Decide who gets the `abusefilter-access-protected-vars` right - https://phabricator.wikimedia.org/T369610 [13:41:50] scap backport should tell you if there’s something wrong with them so it’s probably okay to just try `scap backport 1081415 1081138 1080227` first [13:43:28] !log stran@deploy2002 stran: Backport for [[gerrit:1080250|Apply wmf-specific protected vars rights access (T369610)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:43:39] Nice I didn't know you could backport multiple like that. TIL. I guess we'll find out soon enough if it'll Just Work. [13:43:41] (03PS5) 10Elukey: [DO-NOT-MERGE] sre.hosts.provision: upload the Redfish license [cookbooks] - 10https://gerrit.wikimedia.org/r/1076975 [13:44:22] "you might have to rebase the changes into one chain, not sure" ==> it's helpful. AFAIK, jenkins should rebase conflict-less patches in the config repo at this point (it didn't do that in the past), but having them rebased on top of each other makes things more predictable. [13:44:34] *nods* [13:45:01] that said, if CI fails, you can fix it manually and scap will wait for you to do that, so it's not a problem either way [13:45:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T376905)', diff saved to https://phabricator.wikimedia.org/P70378 and previous config saved to /var/cache/conftool/dbconfig/20241021-134521-ladsgroup.json [13:45:34] Confirmed my groups are as expected with debug, continuing with the AbuseFilter config backport [13:45:39] !log stran@deploy2002 stran: Continuing with sync [13:48:17] (03CR) 10Brouberol: "Ah nice! I'll abandon this one then :)" [puppet] - 10https://gerrit.wikimedia.org/r/1079542 (https://phabricator.wikimedia.org/T327259) (owner: 10Brouberol) [13:48:20] (03Abandoned) 10Brouberol: envoy: Fix firewall_srange not being taken into account [puppet] - 10https://gerrit.wikimedia.org/r/1079542 (https://phabricator.wikimedia.org/T327259) (owner: 10Brouberol) [13:49:10] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10246391 (10ssingh) Hi @RobH: Thanks for writing the instructions in the doc above. The hostnames and other information looks good. Is there any thing pen... [13:49:51] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-Needs-Improvement: Update CAS to 6.6 - https://phabricator.wikimedia.org/T311235#10246393 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF We're long pasted version 6.6 [13:50:15] !log stran@deploy2002 Finished scap sync-world: Backport for [[gerrit:1080250|Apply wmf-specific protected vars rights access (T369610)]] (duration: 08m 53s) [13:50:21] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Kryo memcached transcoder broken in CAS 6.3/6.4 - https://phabricator.wikimedia.org/T273867#10246398 (10SLyngshede-WMF) 05Open→03Invalid Moving ticket registry to Redis. [13:50:32] T369610: Decide who gets the `abusefilter-access-protected-vars` right - https://phabricator.wikimedia.org/T369610 [13:50:42] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti2035.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:52:20] (03CR) 10Btullis: airflow: define a cephfs PVC storing the DAGs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [13:53:04] Starting the Special:GC-related backport, we'll see if it needs a conflict resolution [13:53:21] ok! [13:53:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081415 (https://phabricator.wikimedia.org/T377584) (owner: 10STran) [13:53:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081138 (https://phabricator.wikimedia.org/T376612) (owner: 10STran) [13:53:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080227 (https://phabricator.wikimedia.org/T376132) (owner: 10Kosta Harlan) [13:53:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2035.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:54:24] (03CR) 10Brouberol: airflow: define a cephfs PVC storing the DAGs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [13:54:25] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti2035.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:55:11] (03Merged) 10jenkins-bot: Disable IP reveal rights for local metawiki groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081415 (https://phabricator.wikimedia.org/T377584) (owner: 10STran) [13:55:13] (03Merged) 10jenkins-bot: Set redirect wiki for Special:GlobalContributions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081138 (https://phabricator.wikimedia.org/T376612) (owner: 10STran) [13:55:15] (03Merged) 10jenkins-bot: temp accounts: Make temp accounts known on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080227 (https://phabricator.wikimedia.org/T376132) (owner: 10Kosta Harlan) [13:55:33] !log stran@deploy2002 Started scap sync-world: Backport for [[gerrit:1081415|Disable IP reveal rights for local metawiki groups (T377584)]], [[gerrit:1081138|Set redirect wiki for Special:GlobalContributions (T376612)]], [[gerrit:1080227|temp accounts: Make temp accounts known on metawiki (T376132)]] [13:56:03] T377584: Temporary restrict local access to Special:GlobalContributions - https://phabricator.wikimedia.org/T377584 [13:56:03] T376612: Implement Global Contributions as a central page on Meta and implement redirects from other projects - https://phabricator.wikimedia.org/T376612 [13:56:04] T376132: Set known flag for temporary accounts config on metawiki - https://phabricator.wikimedia.org/T376132 [13:56:29] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:57:40] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2035.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:57:40] !log stran@deploy2002 stran, kharlan: Backport for [[gerrit:1081415|Disable IP reveal rights for local metawiki groups (T377584)]], [[gerrit:1081138|Set redirect wiki for Special:GlobalContributions (T376612)]], [[gerrit:1080227|temp accounts: Make temp accounts known on metawiki (T376132)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:00:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P70379 and previous config saved to /var/cache/conftool/dbconfig/20241021-140028-ladsgroup.json [14:00:29] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10246458 (10elukey) [14:01:04] jouncebot: nowandnext [14:01:04] No deployments scheduled for the next 1 hour(s) and 28 minute(s) [14:01:04] In 1 hour(s) and 28 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241021T1530) [14:01:13] Dreamy_Jazz: Tran is still deploying [14:01:20] Thanks! [14:01:29] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:01:35] Yet to write my config patch, so can wait :D [14:01:46] :D [14:01:52] (03CR) 10Herron: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1081900 (https://phabricator.wikimedia.org/T377433) (owner: 10Máté Szabó) [14:03:04] (03PS1) 10Slyngshede: Allow CAS to have Redis supported enabled in overlay. [puppet] - 10https://gerrit.wikimedia.org/r/1081980 (https://phabricator.wikimedia.org/T377728) [14:05:10] Sorry, was away testing the changes. Changes look as expected (permissions changes to temp account IP access). The redirect isn't directly testable since it's been deployed in advance of the change that will use the config. [14:05:34] Going to continue with the backport [14:05:37] (03CR) 10Btullis: airflow: define a cephfs PVC storing the DAGs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [14:05:47] !log stran@deploy2002 stran, kharlan: Continuing with sync [14:07:18] sounds good [14:07:26] (03PS1) 10Dreamy Jazz: [beta] Enable temporary accounts on all wikis but en-rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081981 (https://phabricator.wikimedia.org/T377262) [14:07:59] My config change is beta only. [14:09:29] (03CR) 10Jgiannelos: [C:03+2] changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [14:10:28] !log stran@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081415|Disable IP reveal rights for local metawiki groups (T377584)]], [[gerrit:1081138|Set redirect wiki for Special:GlobalContributions (T376612)]], [[gerrit:1080227|temp accounts: Make temp accounts known on metawiki (T376132)]] (duration: 14m 55s) [14:10:37] (03Merged) 10jenkins-bot: changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [14:10:55] T377584: Temporary restrict local access to Special:GlobalContributions - https://phabricator.wikimedia.org/T377584 [14:10:57] T376612: Implement Global Contributions as a central page on Meta and implement redirects from other projects - https://phabricator.wikimedia.org/T376612 [14:10:57] T376132: Set known flag for temporary accounts config on metawiki - https://phabricator.wikimedia.org/T376132 [14:12:15] I should be done. Thanks! [14:12:59] \o/ [14:13:12] Dreamy_Jazz: want to deploy* your change now? ^^ [14:13:17] Sure. [14:13:21] (03CR) 10Dreamy Jazz: [C:03+2] [beta] Enable temporary accounts on all wikis but en-rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081981 (https://phabricator.wikimedia.org/T377262) (owner: 10Dreamy Jazz) [14:13:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081981 (https://phabricator.wikimedia.org/T377262) (owner: 10Dreamy Jazz) [14:14:16] (03Merged) 10jenkins-bot: [beta] Enable temporary accounts on all wikis but en-rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081981 (https://phabricator.wikimedia.org/T377262) (owner: 10Dreamy Jazz) [14:14:42] Done, considering that it was a beta-only change. [14:15:09] Lucas_WMDE: Did you want to deploy after me? If so, I'm finished. [14:15:15] nope [14:15:20] !log UTC afternoon backport+config window done [14:15:21] just that ^^ [14:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:24] :D [14:15:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P70380 and previous config saved to /var/cache/conftool/dbconfig/20241021-141535-ladsgroup.json [14:17:51] (03CR) 10Ssingh: "Looks good and some really fancy use of reduce! Some basic questions and nits:" [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) (owner: 10Cathal Mooney) [14:18:44] (03PS4) 10Arturo Borrero Gonzalez: openstack: remove wikitech profiles and base modules [puppet] - 10https://gerrit.wikimedia.org/r/1081968 (https://phabricator.wikimedia.org/T371378) [14:19:18] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081968 (https://phabricator.wikimedia.org/T371378) (owner: 10Arturo Borrero Gonzalez) [14:19:30] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1081968 (https://phabricator.wikimedia.org/T371378) (owner: 10Arturo Borrero Gonzalez) [14:19:59] (03CR) 10Herron: [C:03+2] jaeger: bump to 1.62-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081181 (https://phabricator.wikimedia.org/T376904) (owner: 10Herron) [14:21:08] (03Merged) 10jenkins-bot: jaeger: bump to 1.62-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081181 (https://phabricator.wikimedia.org/T376904) (owner: 10Herron) [14:21:48] (03PS7) 10Brouberol: airflow: define a cephfs PVC storing the DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) [14:23:04] (03PS8) 10Brouberol: airflow: define a cephfs PVC storing the DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) [14:23:08] (03CR) 10Brouberol: airflow: define a cephfs PVC storing the DAGs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [14:25:37] (03PS9) 10Brouberol: airflow: define a cephfs PVC storing the DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) [14:25:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10246585 (10Jgreen) @cmooney Dallas did a survey of existing servers at eqiad, and none of them have 10G interfaces. So 10G por... [14:26:15] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: remove wikitech profiles and base modules [puppet] - 10https://gerrit.wikimedia.org/r/1081968 (https://phabricator.wikimedia.org/T371378) (owner: 10Arturo Borrero Gonzalez) [14:26:35] 06SRE, 06collaboration-services, 10WMF-General-or-Unknown, 07Documentation: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#10246587 (10Jelto) Thanks for raising this issue and uploading the patches. I deployed the new `index.html` for static-... [14:28:23] (03PS2) 10RLazarus: deployment_server: mwscript-k8s logging cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1081265 (https://phabricator.wikimedia.org/T377292) [14:28:23] (03PS1) 10RLazarus: deployment_server: Refactor mwscript-k8s preparatory to adding --output [puppet] - 10https://gerrit.wikimedia.org/r/1081985 (https://phabricator.wikimedia.org/T377292) [14:28:24] (03PS1) 10RLazarus: deployment_server: Add JSON output mode to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1081986 (https://phabricator.wikimedia.org/T377292) [14:29:02] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [14:29:17] 06SRE, 06collaboration-services, 10WMF-General-or-Unknown, 07Documentation: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#10246599 (10Reedy) Thanks! I was still seeing the old file on https://static-codereview.wikimedia.org/, so just pushed... [14:29:26] !log installing PHP 8.2 security updates [14:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T376905)', diff saved to https://phabricator.wikimedia.org/P70381 and previous config saved to /var/cache/conftool/dbconfig/20241021-143042-ladsgroup.json [14:30:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [14:31:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [14:31:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T376905)', diff saved to https://phabricator.wikimedia.org/P70382 and previous config saved to /var/cache/conftool/dbconfig/20241021-143108-ladsgroup.json [14:31:39] !log herron@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:31:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10246614 (10cmooney) >>! In T377381#10246582, @Jgreen wrote: > @cmooney Dallas did a survey of existing servers at eqiad, and n... [14:32:36] !log herron@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:33:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10246616 (10cmooney) [14:34:13] (03PS1) 10Bking: statistics::explorer hosts: better visibility into processes [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) [14:34:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [14:34:54] (03PS1) 10STran: Disable local IP view right group on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081988 (https://phabricator.wikimedia.org/T377584) [14:35:13] (03CR) 10Brouberol: [C:03+2] airflow: define a cephfs PVC storing the DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081922 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [14:37:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T376905)', diff saved to https://phabricator.wikimedia.org/P70383 and previous config saved to /var/cache/conftool/dbconfig/20241021-143818-ladsgroup.json [14:39:05] (03PS2) 10Bking: statistics::explorer hosts: better visibility into processes [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) [14:39:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [14:40:40] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#10246685 (10Volans) Now that we have dbctl support in Spicerack it should be doable to add the step to modify the IP in dbctl when needed. [14:42:20] 10SRE-tools, 06Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593#10246698 (10Volans) a:05Volans→03None De-assigning from me as I've not worked on it or plan to do so soon. [14:44:04] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485#10246717 (10Volans) a:05Volans→03None [14:46:28] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1081969 (owner: 10EoghanGaffney) [14:46:48] (03CR) 10Dreamy Jazz: [C:03+1] Disable local IP view right group on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081988 (https://phabricator.wikimedia.org/T377584) (owner: 10STran) [14:47:09] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774#10246712 (10Volans) 05Open→03Resolved Tentatively resolving, I can't repro it anymore. If you encounter the same issue please re-open it. [14:48:47] 👋 I'm back. Is it possible for me to backport one more config patch? I realized the combination of patches I backported caused a state not covered by those patches. [14:48:53] (03CR) 10EoghanGaffney: [C:03+2] lists: Fix ATS backend map target for lists.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/1081969 (owner: 10EoghanGaffney) [14:49:15] jouncebot: now [14:49:15] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [14:49:23] Tran: I think if nobody objects within a few minutes you can go ahead [14:49:45] * Lucas_WMDE peeks at scap locks [14:49:58] (03CR) 10Elukey: [C:04-1] "Tested https://python-semver.readthedocs.io/en/latest/usage/parse-version-string.html and it worked nicely :)" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 (owner: 10Elukey) [14:49:59] `cat /var/lock/scap*` looks empty too [14:51:23] (03PS3) 10Elukey: registry: expand the HTTP Accept headers and drop v1 compatibility [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 [14:52:15] (03PS4) 10Elukey: registry: expand the HTTP Accept headers and drop v1 compatibility [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 [14:53:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P70384 and previous config saved to /var/cache/conftool/dbconfig/20241021-145325-ladsgroup.json [14:53:25] (03PS2) 10STran: Disable local IP view right group on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081988 (https://phabricator.wikimedia.org/T377584) [14:55:07] (03CR) 10Alexandros Kosiaris: [C:03+1] "Done" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 (owner: 10Elukey) [14:55:45] (03CR) 10Dreamy Jazz: [C:03+1] "We want to avoid causing auto-promotion into a group with no rights, given that it would likely cause the group to not appear at `Special:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081988 (https://phabricator.wikimedia.org/T377584) (owner: 10STran) [14:56:43] Thanks for checking for me! Just got the +1 from my teammate. If no one has any objections, I'll start my backport. [14:57:05] 06SRE, 10decommission-hardware: Decommission ganeti2009/ganeti2010 - https://phabricator.wikimedia.org/T377741 (10MoritzMuehlenhoff) 03NEW [14:57:14] 06SRE, 10decommission-hardware: Decommission ganeti2009/ganeti2010 - https://phabricator.wikimedia.org/T377741#10246812 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [14:57:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081988 (https://phabricator.wikimedia.org/T377584) (owner: 10STran) [14:58:22] (03CR) 10Elukey: "@akosiaris@wikimedia.org: re: bundle the two together - I had to do it since afaics there is no way to get the old "history" with the new " [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 (owner: 10Elukey) [14:58:50] 06SRE, 06collaboration-services, 10WMF-General-or-Unknown, 07Documentation: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#10246815 (10Jelto) 05Open→03Resolved Yes I can confirm https://static-codereview.wikimedia.org/ shows the new c... [14:59:18] (03Merged) 10jenkins-bot: Disable local IP view right group on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081988 (https://phabricator.wikimedia.org/T377584) (owner: 10STran) [14:59:34] !log stran@deploy2002 Started scap sync-world: Backport for [[gerrit:1081988|Disable local IP view right group on meta (T377584)]] [14:59:54] T377584: Temporarily restrict local access to Special:GlobalContributions - https://phabricator.wikimedia.org/T377584 [15:01:41] (03PS1) 10Dreamy Jazz: [beta] Enable global autoblocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081990 (https://phabricator.wikimedia.org/T377737) [15:01:49] !log stran@deploy2002 stran: Backport for [[gerrit:1081988|Disable local IP view right group on meta (T377584)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:02:15] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:42] Group permissions look like I expect them to, continuing with the deploy [15:02:45] !log stran@deploy2002 stran: Continuing with sync [15:03:41] (03CR) 10Dreamy Jazz: [C:03+2] [beta] Enable global autoblocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081990 (https://phabricator.wikimedia.org/T377737) (owner: 10Dreamy Jazz) [15:04:25] (03Merged) 10jenkins-bot: [beta] Enable global autoblocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081990 (https://phabricator.wikimedia.org/T377737) (owner: 10Dreamy Jazz) [15:04:41] Just merged a beta only config change. [15:05:10] Will run scap backport to update the git repo once the backport in progress is complete. [15:06:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org [15:08:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P70385 and previous config saved to /var/cache/conftool/dbconfig/20241021-150832-ladsgroup.json [15:10:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10246886 (10Jgreen) > That's a bit of a shame in some ways but no problems We'll get there next procurement cycle! Note that w... [15:12:37] (03CR) 10Vgutierrez: varnish: Give 1% of views RSA cert warnings [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [15:14:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org [15:17:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [15:17:46] There's a timeout that seems to be causing one of the hosts to fail: `ssh: connect to host cloudweb2002-dev.wikimedia.org port 22: Connection timed out` [15:19:08] (03CR) 10Majavah: P:toolforge::proxy: use svc.toolforge.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1080056 (owner: 10Majavah) [15:20:03] !log stran@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081988|Disable local IP view right group on meta (T377584)]] (duration: 20m 29s) [15:20:25] !log rearm keyholder on netmon2002 [15:20:32] T377584: Temporarily restrict local access to Special:GlobalContributions - https://phabricator.wikimedia.org/T377584 [15:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:28] RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [15:22:29] Well, that wasn't successful. What should I do in this case? I got the error `15:20:03 backport failed: Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=stran', 'Backport for [[gerrit:1081988|Disable local IP view right group on meta (T377584)]]']' returned non-zero exit status 1. (scap version: 4.114.0)` so I assume the backport didn't finish but I don't know what to do from this [15:22:30] state. [15:23:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T376905)', diff saved to https://phabricator.wikimedia.org/P70386 and previous config saved to /var/cache/conftool/dbconfig/20241021-152339-ladsgroup.json [15:23:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance [15:24:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance [15:24:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T376905)', diff saved to https://phabricator.wikimedia.org/P70387 and previous config saved to /var/cache/conftool/dbconfig/20241021-152408-ladsgroup.json [15:25:50] I think that usually means that one host failed but the others are fine [15:26:00] did the “big” k8s deployment finish successfully? [15:26:59] oh, cloudweb2002-dev seems to be Kubernetes-related? (judging by https://phabricator.wikimedia.org/T292707#10196729) [15:27:05] I don't think so? I think that last host was part of it [15:27:24] hm, ok [15:28:00] https://www.irccloud.com/pastebin/yIlK1wmT/ [15:28:05] (03PS10) 10Volans: sre.mysql.pool: add two new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) [15:28:24] yeah I think that means sync-prod-k8s was fine [15:28:34] (03CR) 10Volans: "Updated based on DRY-RUN testing" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) (owner: 10Volans) [15:28:43] (03PS1) 10Joely Rooke WMDE: Activate feature flag to default move wikibase sidebar link to other projects. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081995 (https://phabricator.wikimedia.org/T66315) [15:28:59] * Lucas_WMDE apparently lacks permission to SSH into cloudweb2002-dev directly [15:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241021T1530). [15:30:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081995 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [15:30:25] I see `1 apaches had sync errors`, `1 hosts had scap-cdb-rebuild errors`, `1 hosts had sync_wikiversions errors` https://www.irccloud.com/pastebin/peLkZPkh/ [15:30:33] presumably that one host [15:30:54] looks like that one host is unresponsive [15:31:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T376905)', diff saved to https://phabricator.wikimedia.org/P70388 and previous config saved to /var/cache/conftool/dbconfig/20241021-153113-ladsgroup.json [15:32:12] I’m hoping someone else here knows what’s up with cloudweb2002-dev tbh [15:32:19] if not, might be worth asking over in #wikimedia-sre [15:32:34] !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [15:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:43] ssh: connect to host cloudweb2002-dev.wikimedia.org port 22: Connection timed out [15:33:46] (from deploy20002) [15:33:53] Sure, is it a problem that it's now the portals update window and this is happening? [15:34:12] hey Lucas_WMDE, I did a change to cloudweb2002-dev today [15:34:17] hi :) [15:34:31] it was showing errors during Tran’s scap backport [15:34:52] what I did was to merge this patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1081968 [15:35:20] basically, drop everything related to mediawiki from the host, since it no longer runs it [15:35:53] does that mean scap should no longer be trying to connect to it for a scap pull? [15:36:11] I guess so? is there a list somewhere where scap may be configured to reach to that host? [15:36:24] I guess so, but I have no idea where it lives [15:36:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [15:36:42] effie: are you around to assist with this? [15:37:04] https://gerrit.wikimedia.org/g/operations/puppet/+/94fd29466c/hieradata/common/scap/dsh.yaml#28 looks like a potential candidate [15:37:29] two other cloudwebs were removed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1077008/4/hieradata/common/scap/dsh.yaml seemingly [15:37:38] we're in a meeting, remove it from the dsh list [15:37:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2172.codfw.wmnet onto db2240.codfw.wmnet [15:37:55] claime: ok, will do [15:37:59] thanks! [15:39:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [15:39:38] !log Starting MediaModeration scanning script for 12 hrs on enwiki - https://wikitech.wikimedia.org/wiki/MediaModeration [15:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:57] (03PS1) 10Arturo Borrero Gonzalez: scap: dsh: remove cloudweb2002-dev from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/1081997 (https://phabricator.wikimedia.org/T371378) [15:40:14] Lucas_WMDE: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1081997 is the patch [15:40:25] cc claime ^^^ [15:41:16] (03CR) 10Clément Goubert: [C:03+1] scap: dsh: remove cloudweb2002-dev from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/1081997 (https://phabricator.wikimedia.org/T371378) (owner: 10Arturo Borrero Gonzalez) [15:41:27] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] scap: dsh: remove cloudweb2002-dev from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/1081997 (https://phabricator.wikimedia.org/T371378) (owner: 10Arturo Borrero Gonzalez) [15:41:53] arturo: thanks! [15:42:36] (03PS1) 10Cwhite: logstash: drop datahub-upgrade-job logs [puppet] - 10https://gerrit.wikimedia.org/r/1081998 (https://phabricator.wikimedia.org/T363856) [15:42:41] kick off a puppet run on the deployment server and it should be fine [15:42:47] Thank you friends 🙇 I guess this means my backport was successful? Since that host wasn't meant to be reached anyway. [15:43:44] running puppet agent on deploy1003/2002 [15:44:00] Tran: did it proceed with the k8s steps afterwards and all that? [15:44:44] (03CR) 10Cwhite: [C:03+2] logstash: drop datahub-upgrade-job logs [puppet] - 10https://gerrit.wikimedia.org/r/1081998 (https://phabricator.wikimedia.org/T363856) (owner: 10Cwhite) [15:44:57] I think so, here are the full logs of the sync. I see 100% on the K8s deployment progress. https://www.irccloud.com/pastebin/jLypu4md/ [15:46:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P70389 and previous config saved to /var/cache/conftool/dbconfig/20241021-154620-ladsgroup.json [15:46:24] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10247069 (10RobH) I still need to finish up the directions and will ping everyone to proof them again, but with all the ordering this week I just got a bi... [15:46:25] yeah should be fine [15:46:33] great, thank you! [15:47:01] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10247070 (10Papaul) [15:47:22] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10247072 (10ssingh) Thanks @RobH. Please let us know if Traffic's input is required or if we can help. [15:48:37] (03CR) 10Sergio Gimeno: [Growth] beta: configure the A/B test experiment variants (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [15:52:48] puppet agent did the expected change on the deploy servers [15:53:14] !log volans@cumin1002 START - Cookbook sre.mysql.depool db1185 - Testing new cookbook [15:53:51] !log volans@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1185 - Testing new cookbook [15:55:04] (03PS1) 10Gmodena: airflow: analytics: alert only on task failure [puppet] - 10https://gerrit.wikimedia.org/r/1082001 (https://phabricator.wikimedia.org/T377745) [15:55:32] !log volans@cumin1002 START - Cookbook sre.mysql.pool db1185 gradually with 4 steps - Testing new cookbook [15:55:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:56:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:58:56] !log volans@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1185 gradually with 4 steps - Testing new cookbook [16:01:23] (03PS3) 10Bking: statistics::explorer hosts: better visibility into processes [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) [16:01:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P70395 and previous config saved to /var/cache/conftool/dbconfig/20241021-160127-ladsgroup.json [16:01:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [16:03:06] (03PS4) 10Bking: statistics::explorer hosts: better visibility into processes [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) [16:04:30] arturo: I was in a meeting, sorry [16:04:48] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [16:04:59] (03PS1) 10Gmodena: analytics: refine: set smtp for refine job. [puppet] - 10https://gerrit.wikimedia.org/r/1082003 (https://phabricator.wikimedia.org/T377698) [16:06:24] (03CR) 10Ottomata: [C:03+1] analytics: refine: set smtp for refine job. [puppet] - 10https://gerrit.wikimedia.org/r/1082003 (https://phabricator.wikimedia.org/T377698) (owner: 10Gmodena) [16:07:02] (03PS5) 10Bking: statistics::explorer hosts: better visibility into processes [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) [16:11:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 25%: post clone', diff saved to https://phabricator.wikimedia.org/P70396 and previous config saved to /var/cache/conftool/dbconfig/20241021-161108-arnaudb.json [16:11:15] (03PS1) 10Herron: jaeger: bump jaeger-builder version to 1.62.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082006 (https://phabricator.wikimedia.org/T376904) [16:12:38] (03CR) 10Ssingh: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [16:13:10] (03CR) 10Ssingh: "We should plan on getting this merged this week. If we can address the nit, that would be great IMO (unless you feel there is no need) but" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [16:15:07] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:16:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T376905)', diff saved to https://phabricator.wikimedia.org/P70397 and previous config saved to /var/cache/conftool/dbconfig/20241021-161634-ladsgroup.json [16:16:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: Maintenance [16:16:45] (03CR) 10CDanis: [C:03+1] jaeger: bump jaeger-builder version to 1.62.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082006 (https://phabricator.wikimedia.org/T376904) (owner: 10Herron) [16:16:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: Maintenance [16:17:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2213 (T376905)', diff saved to https://phabricator.wikimedia.org/P70398 and previous config saved to /var/cache/conftool/dbconfig/20241021-161701-ladsgroup.json [16:17:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:17:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:18:10] !log volans@cumin1002 START - Cookbook sre.mysql.depool db1185 - Testing new cookbook [16:18:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:19:10] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:19:23] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:20:02] (03CR) 10Herron: [V:03+2 C:03+2] jaeger: bump jaeger-builder version to 1.62.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082006 (https://phabricator.wikimedia.org/T376904) (owner: 10Herron) [16:21:11] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377317#10247233 (10phaultfinder) [16:21:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:21:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:22:26] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:22:36] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:22:54] !log volans@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) db1185 - Testing new cookbook [16:25:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T376905)', diff saved to https://phabricator.wikimedia.org/P70399 and previous config saved to /var/cache/conftool/dbconfig/20241021-162525-ladsgroup.json [16:25:35] !log volans@cumin1002 START - Cookbook sre.mysql.depool db1185 - Testing new cookbook [16:25:56] !log volans@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1185 - Testing new cookbook [16:26:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 50%: post clone', diff saved to https://phabricator.wikimedia.org/P70401 and previous config saved to /var/cache/conftool/dbconfig/20241021-162613-arnaudb.json [16:26:17] (03CR) 10Btullis: [C:03+2] analytics: refine: set smtp for refine job. [puppet] - 10https://gerrit.wikimedia.org/r/1082003 (https://phabricator.wikimedia.org/T377698) (owner: 10Gmodena) [16:27:25] !log volans@cumin1002 START - Cookbook sre.mysql.pool db1185 quickly with 2 steps - Testing new cookbook [16:28:15] !log volans@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1185 quickly with 2 steps - Testing new cookbook [16:29:18] !log volans@cumin1002 START - Cookbook sre.mysql.pool db1185 quickly with 2 steps - Testing new cookbook [16:32:03] !log volans@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1185 quickly with 2 steps - Testing new cookbook [16:33:56] !log volans@cumin1002 dbctl commit (dc=all): 'Fix db1185 weight', diff saved to https://phabricator.wikimedia.org/P70404 and previous config saved to /var/cache/conftool/dbconfig/20241021-163355-volans.json [16:36:59] (03PS1) 10Herron: jaeger: fix collector/query changelog entries [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082039 [16:37:23] (03CR) 10Herron: [V:03+2 C:03+2] jaeger: fix collector/query changelog entries [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082039 (owner: 10Herron) [16:38:30] (03CR) 10Ssingh: "We are at around 0.09% of RSA usage so it's time to merge this! Just one question about the timing in the warning page but otherwise looks" [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [16:38:50] (03CR) 10Xcollazo: [C:03+1] airflow: analytics: alert only on task failure [puppet] - 10https://gerrit.wikimedia.org/r/1082001 (https://phabricator.wikimedia.org/T377745) (owner: 10Gmodena) [16:38:54] (03PS1) 10Herron: jaeger: bump to 1.62-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082040 [16:40:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P70405 and previous config saved to /var/cache/conftool/dbconfig/20241021-164032-ladsgroup.json [16:41:13] (03CR) 10Herron: [C:03+2] jaeger: bump to 1.62-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082040 (owner: 10Herron) [16:41:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10247359 (10elukey) I tried a new firmware but it didn't work, same error. I noticed that the hosts showing the issue are the same exact model (https:/... [16:41:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 75%: post clone', diff saved to https://phabricator.wikimedia.org/P70406 and previous config saved to /var/cache/conftool/dbconfig/20241021-164119-arnaudb.json [16:42:20] (03Merged) 10jenkins-bot: jaeger: bump to 1.62-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082040 (owner: 10Herron) [16:43:46] !log herron@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [16:44:14] !log herron@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [16:54:21] (03PS4) 10Fabfur: haproxykafka: start working on haproxykafka puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1081264 (https://phabricator.wikimedia.org/T374128) [16:55:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P70407 and previous config saved to /var/cache/conftool/dbconfig/20241021-165539-ladsgroup.json [16:56:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 100%: post clone', diff saved to https://phabricator.wikimedia.org/P70408 and previous config saved to /var/cache/conftool/dbconfig/20241021-165624-arnaudb.json [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241021T1700) [17:00:05] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241021T1700). [17:02:13] (03CR) 10Scott French: [C:03+2] deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) (owner: 10Dduvall) [17:10:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T376905)', diff saved to https://phabricator.wikimedia.org/P70409 and previous config saved to /var/cache/conftool/dbconfig/20241021-171046-ladsgroup.json [17:11:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [17:11:25] (03PS4) 10Dzahn: aphlict: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) [17:11:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [17:11:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T376905)', diff saved to https://phabricator.wikimedia.org/P70410 and previous config saved to /var/cache/conftool/dbconfig/20241021-171138-ladsgroup.json [17:14:21] (03PS1) 10Hnowlan: sessionstore: use service mesh in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082049 (https://phabricator.wikimedia.org/T363996) [17:16:05] (03CR) 10Dzahn: "still getting Error: Could not find resource 'File[/etc/ferm/conf.d]' in parameter 'require'" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:16:19] (03PS6) 10Muehlenhoff: peopleweb: limit envoy srange to CACHES and DEPLOYMENT servers [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:18:36] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1071927/4326/people1004.eqiad.wmnet/change.people1004.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:19:02] (03CR) 10Dzahn: [V:04-1 C:04-1] peopleweb: limit envoy srange to CACHES and DEPLOYMENT servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377317#10247619 (10phaultfinder) [17:20:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T376905)', diff saved to https://phabricator.wikimedia.org/P70411 and previous config saved to /var/cache/conftool/dbconfig/20241021-172051-ladsgroup.json [17:22:35] (03PS4) 10Dzahn: gerrit: delete temp gerrit setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074488 [17:23:14] (03PS5) 10Dzahn: gerrit: delete temp gerrit setup role [puppet] - 10https://gerrit.wikimedia.org/r/1074488 (https://phabricator.wikimedia.org/T372804) [17:24:43] (03CR) 10Dzahn: [V:03+1 C:03+2] "no diff in prod now: https://puppet-compiler.wmflabs.org/output/1074488/4327/" [puppet] - 10https://gerrit.wikimedia.org/r/1074488 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [17:25:28] (03PS3) 10Scott French: service: move mw-web-next to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1080789 (https://phabricator.wikimedia.org/T377040) [17:25:29] (03PS3) 10Scott French: service: move mw-web-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1080790 (https://phabricator.wikimedia.org/T377040) [17:25:31] (03PS1) 10Scott French: service: move mw-api-ext-next to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1082050 (https://phabricator.wikimedia.org/T377040) [17:25:32] (03PS1) 10Scott French: service: move mw-api-ext-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1082051 (https://phabricator.wikimedia.org/T377040) [17:27:01] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10247685 (10Dzahn) 05Stalled→03In progress [17:30:58] (03CR) 10Scott French: [C:03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [17:35:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P70412 and previous config saved to /var/cache/conftool/dbconfig/20241021-173558-ladsgroup.json [17:36:47] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10247736 (10Dzahn) LFS data syncing from the prod servers is also already setup and happened via the puppetized timer: ` root@gerrit2003:/srv/gerrit/data/lfs# du -hs . 3... [17:38:19] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10247729 (10Dzahn) 05In progress→03Resolved Discussed in today's team meeting. This server is up and running with the production role on it now, so the task to setup... [17:38:29] (03CR) 10Alexandros Kosiaris: [C:03+1] service: move mw-web-next to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1080789 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [17:38:44] (03CR) 10Alexandros Kosiaris: [C:03+1] service: move mw-api-ext-next to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1082050 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [17:41:48] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:42:21] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:43:48] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:44:09] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:45:06] (03CR) 10Dzahn: "I am attaching the service IP/name that already existed from the past to this newer instance. Not creating i from scratch. And won't be un" [puppet] - 10https://gerrit.wikimedia.org/r/1081286 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [17:48:46] (03CR) 10Dzahn: [C:03+2] cloud/devtools: set service IP to existing gerrit.devtools.wmcloud.org. [puppet] - 10https://gerrit.wikimedia.org/r/1081286 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [17:48:53] (03PS1) 10Gergő Tisza: fix(AuthManagerStatsd): counters require static set of labels [extensions/WikimediaEvents] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082058 (https://phabricator.wikimedia.org/T377476) [17:48:58] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@671896c]: Deploy T375402. [17:49:24] T375402: Tune Dumps 2.0 hourly ingestion jobs - https://phabricator.wikimedia.org/T375402 [17:49:33] (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082049 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [17:50:02] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@671896c]: Deploy T375402. (duration: 01m 04s) [17:50:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/WikimediaEvents] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082058 (https://phabricator.wikimedia.org/T377476) (owner: 10Gergő Tisza) [17:51:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P70413 and previous config saved to /var/cache/conftool/dbconfig/20241021-175105-ladsgroup.json [17:52:08] !log dduvall@deploy2002 Installing scap version "4.115.0" for 209 hosts [17:52:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10247833 (10elukey) Made it work, I had to factory reset after the firmware upgrade to get the new default 'calvin' password to work correctly, plus al... [17:53:12] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [17:53:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10247834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm [17:53:27] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1012.eqiad.wmnet with OS bookworm [17:53:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10247838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm execu... [17:56:45] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [17:56:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10247847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm [17:57:48] FIRING: [16x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:59:02] !log ran disable-puppet on 'A:lvs and (A:eqiad or A:codfw)' - T377040 [17:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:33] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [17:59:52] (03CR) 10Scott French: [C:03+2] service: move mw-api-ext-next to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1082050 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:03:01] (03PS1) 10Michael Große: eswiki: switch clearing link recommendations to PageSaveComplete hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082057 (https://phabricator.wikimedia.org/T372337) [18:03:40] (03PS2) 10Michael Große: frwiki: switch clearing link recommendations to PageSaveComplete hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082057 (https://phabricator.wikimedia.org/T372337) [18:04:42] !log ran and enabled pupppet agent on 'A:lvs and A:eqiad' - T377040 [18:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:13] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [18:05:15] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad (T377040) [18:05:18] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082057 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [18:05:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082057 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [18:06:02] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad (T377040) [18:06:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T376905)', diff saved to https://phabricator.wikimedia.org/P70414 and previous config saved to /var/cache/conftool/dbconfig/20241021-180612-ladsgroup.json [18:06:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [18:06:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [18:06:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [18:06:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [18:06:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T376905)', diff saved to https://phabricator.wikimedia.org/P70415 and previous config saved to /var/cache/conftool/dbconfig/20241021-180654-ladsgroup.json [18:09:42] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad (T377040) [18:11:09] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1012.eqiad.wmnet with reason: host reimage [18:14:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T376905)', diff saved to https://phabricator.wikimedia.org/P70416 and previous config saved to /var/cache/conftool/dbconfig/20241021-181410-ladsgroup.json [18:14:10] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1012.eqiad.wmnet with reason: host reimage [18:15:41] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad (T377040) [18:16:09] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [18:19:32] !log ran and enabled pupppet agent on 'A:lvs and A:codfw' - T377040 [18:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:50] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T377040) [18:20:35] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T377040) [18:21:17] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [18:22:41] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw (T377040) [18:23:27] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw (T377040) [18:28:30] (03PS1) 10Dzahn: Revert "cloud/devtools: set service IP to existing gerrit.devtools.wmcloud.org." [puppet] - 10https://gerrit.wikimedia.org/r/1082066 [18:29:09] (03CR) 10Dzahn: [C:03+2] Revert "cloud/devtools: set service IP to existing gerrit.devtools.wmcloud.org." [puppet] - 10https://gerrit.wikimedia.org/r/1082066 (owner: 10Dzahn) [18:29:14] (03CR) 10Fabfur: profile: Provide a liberica profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [18:29:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P70417 and previous config saved to /var/cache/conftool/dbconfig/20241021-182916-ladsgroup.json [18:32:06] !log ran disable-puppet on 'A:lvs and (A:eqiad or A:codfw)' - T377040 [18:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:21] (03CR) 10Scott French: [C:03+2] service: move mw-web-next to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1080789 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:32:42] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [18:33:55] jouncebot: nowandnext [18:33:55] No deployments scheduled for the next 1 hour(s) and 26 minute(s) [18:33:55] In 1 hour(s) and 26 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241021T2000) [18:34:12] (03CR) 10Zabe: [C:03+2] s4: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081463 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [18:34:56] (03Merged) 10jenkins-bot: s4: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081463 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [18:35:15] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1081463|s4: Reduce revision-slots cache expiry to 60 seconds (T183490)]] [18:35:40] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [18:36:11] !log ran and enabled puppet agent on 'A:lvs and A:eqiad' - T377040 [18:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:11] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad (T377040) [18:37:18] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [18:42:21] !log zabe@deploy2002 zabe: Backport for [[gerrit:1081463|s4: Reduce revision-slots cache expiry to 60 seconds (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:42:34] !log zabe@deploy2002 zabe: Continuing with sync [18:42:41] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [18:43:11] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad (T377040) [18:43:29] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [18:44:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P70418 and previous config saved to /var/cache/conftool/dbconfig/20241021-184424-ladsgroup.json [18:45:11] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [18:45:12] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1012.eqiad.wmnet with OS bookworm [18:45:36] zabe: do you have any other backports planned after this one? I'm in the middle of deploying some changes that are somewhat complex and probably shouldn't happen at the same time as a deployment [18:45:53] nope [18:46:03] you can do your stuff then [18:46:26] zabe: ack, I'll hold the moment until I see your deployment complete [18:46:56] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10247992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm compl... [18:48:48] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10248001 (10elukey) 05Open→03Resolved Tested the reimage as well, all works! @jcrespo thanks for the patience, green light for production :) [18:49:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377317#10248009 (10phaultfinder) [18:51:25] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081463|s4: Reduce revision-slots cache expiry to 60 seconds (T183490)]] (duration: 16m 09s) [18:51:27] swfrench-wmf: done [18:51:38] zabe: ack, thanks! [18:52:00] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad (T377040) [18:52:27] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [18:53:31] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [18:57:59] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad (T377040) [18:59:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T376905)', diff saved to https://phabricator.wikimedia.org/P70419 and previous config saved to /var/cache/conftool/dbconfig/20241021-185931-ladsgroup.json [18:59:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [18:59:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [18:59:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T376905)', diff saved to https://phabricator.wikimedia.org/P70420 and previous config saved to /var/cache/conftool/dbconfig/20241021-185957-ladsgroup.json [19:01:02] !log ran and enabled puppet agent on 'A:lvs and A:codfw' - T377040 [19:01:14] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T377040) [19:01:42] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T377040) [19:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:18] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [19:04:48] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw (T377040) [19:07:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T376905)', diff saved to https://phabricator.wikimedia.org/P70421 and previous config saved to /var/cache/conftool/dbconfig/20241021-190712-ladsgroup.json [19:09:54] (03PS1) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [19:10:29] (03CR) 10CI reject: [V:04-1] statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [19:10:47] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw (T377040) [19:11:13] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [19:12:06] (03CR) 10Vgutierrez: profile: Provide a liberica profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [19:12:23] (03PS2) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [19:12:57] (03CR) 10CI reject: [V:04-1] statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [19:16:15] (03PS4) 10Scott French: service: move mw-web-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1080790 (https://phabricator.wikimedia.org/T377040) [19:16:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:17:03] (03PS2) 10Scott French: service: move mw-api-ext-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1082051 (https://phabricator.wikimedia.org/T377040) [19:22:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P70422 and previous config saved to /var/cache/conftool/dbconfig/20241021-192219-ladsgroup.json [19:22:48] (03PS2) 10Scott French: wmnet: add DYNA records for mw-(web|api-ext)-next [dns] - 10https://gerrit.wikimedia.org/r/1080779 (https://phabricator.wikimedia.org/T377040) [19:23:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377317#10248108 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Rebalanced power [19:29:44] (03PS3) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [19:30:13] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=mw-web-next,name=codfw [reason: preparing mw-web-next (a/p) for discovery - T377040] [19:30:21] (03CR) 10CI reject: [V:04-1] statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [19:30:41] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [19:31:06] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=mw-api-ext-next,name=codfw [reason: preparing mw-api-ext-next (a/p) for discovery - T377040] [19:33:04] (03PS4) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [19:33:38] (03CR) 10CI reject: [V:04-1] statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [19:34:33] (03CR) 10RLazarus: [C:03+1] service: move mw-web-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1080790 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:34:39] (03CR) 10RLazarus: [C:03+1] service: move mw-api-ext-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1082051 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:34:54] !log dduvall@deploy2002 Started deploy [releng/jenkins-deploy@b75c4aa] (releasing): Deploying changes to MediaWiki branch and publish WMF single-version image job [19:34:59] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=mw-web-next-ro,name=codfw [reason: preparing mw-web-next-ro (a/a) for discovery - T377040] [19:35:08] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=mw-web-next-ro,name=eqiad [reason: preparing mw-web-next-ro (a/a) for discovery - T377040] [19:35:33] !log dduvall@deploy2002 Finished deploy [releng/jenkins-deploy@b75c4aa] (releasing): Deploying changes to MediaWiki branch and publish WMF single-version image job (duration: 01m 20s) [19:36:06] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [19:36:32] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=mw-api-ext-next-ro,name=codfw [reason: preparing mw-api-ext-next-ro (a/a) for discovery - T377040] [19:36:41] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=mw-api-ext-next-ro,name=eqiad [reason: preparing mw-api-ext-next-ro (a/a) for discovery - T377040] [19:37:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P70423 and previous config saved to /var/cache/conftool/dbconfig/20241021-193726-ladsgroup.json [19:38:20] (03CR) 10Scott French: [C:03+2] service: move mw-web-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1080790 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:40:24] (03PS5) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [19:41:00] (03CR) 10CI reject: [V:04-1] statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [19:44:13] (03CR) 10Scott French: [C:03+2] service: move mw-api-ext-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1082051 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:44:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082078 [19:48:31] (03PS3) 10Scott French: wmnet: add DYNA records for mw-(web|api-ext)-next [dns] - 10https://gerrit.wikimedia.org/r/1080779 (https://phabricator.wikimedia.org/T377040) [19:50:03] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1166 is not coming back online - https://phabricator.wikimedia.org/T377464#10248208 (10Jclark-ctr) looks like it was a failure on the 1g ports I have updated idrac firmware ` 04:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit... [19:50:21] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1166 is not coming back online - https://phabricator.wikimedia.org/T377464#10248209 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [19:51:08] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10248216 (10Jclark-ctr) a:03VRiley-WMF looks like the mgmt ticket you had done just came right back T376764 @VRiley-WMF [19:52:14] (03PS6) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [19:52:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T376905)', diff saved to https://phabricator.wikimedia.org/P70424 and previous config saved to /var/cache/conftool/dbconfig/20241021-195233-ladsgroup.json [19:52:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [19:52:41] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10248228 (10Jclark-ctr) a:03Jclark-ctr [19:52:47] (03CR) 10CI reject: [V:04-1] statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [19:52:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [19:53:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T376905)', diff saved to https://phabricator.wikimedia.org/P70425 and previous config saved to /var/cache/conftool/dbconfig/20241021-195300-ladsgroup.json [19:53:13] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T377686#10248234 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate T362033 [19:55:42] (03PS7) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [19:56:17] (03CR) 10CI reject: [V:04-1] statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [19:57:51] (03PS8) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [19:58:24] (03CR) 10CI reject: [V:04-1] statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241021T2000) [20:00:05] MatmaRex, tgr, and MichaelG_WMF: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T376905)', diff saved to https://phabricator.wikimedia.org/P70426 and previous config saved to /var/cache/conftool/dbconfig/20241021-200015-ladsgroup.json [20:00:21] o/ [20:00:25] o/ [20:00:34] hi [20:00:39] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#10248292 (10Peachey88) [20:01:12] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T377686#10248290 (10Peachey88) →14Duplicate dup:03T362033 [20:01:47] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1166 is not coming back online - https://phabricator.wikimedia.org/T377464#10248301 (10Ladsgroup) Thanks! [20:05:29] I can do the backports [20:06:20] (03CR) 10Ssingh: [C:03+1] wmnet: add DYNA records for mw-(web|api-ext)-next [dns] - 10https://gerrit.wikimedia.org/r/1080779 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [20:06:42] (03CR) 10Ssingh: "+1; I think it's fine to merge this as one patch." [dns] - 10https://gerrit.wikimedia.org/r/1080779 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [20:06:47] (03CR) 10Ssingh: [C:03+1] wmnet: add DYNA records for mw-(web|api-ext)-next [dns] - 10https://gerrit.wikimedia.org/r/1080779 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [20:08:27] yay, thank you tgr|away! [20:08:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081445 (https://phabricator.wikimedia.org/T301483) (owner: 10Bartosz Dziewoński) [20:08:52] (03CR) 10CI reject: [V:04-1] Re-apply "Set special footer licence message for MediaWiki.org re. Help: pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081445 (https://phabricator.wikimedia.org/T301483) (owner: 10Bartosz Dziewoński) [20:09:14] (03PS2) 10Bartosz Dziewoński: Re-apply "Set special footer licence message for MediaWiki.org re. Help: pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081445 (https://phabricator.wikimedia.org/T301483) [20:10:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081445 (https://phabricator.wikimedia.org/T301483) (owner: 10Bartosz Dziewoński) [20:11:02] (03Merged) 10jenkins-bot: Re-apply "Set special footer licence message for MediaWiki.org re. Help: pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081445 (https://phabricator.wikimedia.org/T301483) (owner: 10Bartosz Dziewoński) [20:11:19] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1081445|Re-apply "Set special footer licence message for MediaWiki.org re. Help: pages" (T301483)]] [20:11:39] T301483: Change the copyright warning for Mediawikiwiki's Help: namespace to CC0 - https://phabricator.wikimedia.org/T301483 [20:13:36] !log tgr@deploy2002 matmarex, tgr: Backport for [[gerrit:1081445|Re-apply "Set special footer licence message for MediaWiki.org re. Help: pages" (T301483)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:13:57] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [20:14:38] MatmaRex: do you want to test? [20:14:47] yep, testing [20:15:05] since apparently it did not work the last time it was tried (2 years ago) [20:15:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P70427 and previous config saved to /var/cache/conftool/dbconfig/20241021-201522-ladsgroup.json [20:16:00] but everything looks good to me now [20:16:07] tgr|away: good to go [20:16:23] !log tgr@deploy2002 matmarex, tgr: Continuing with sync [20:16:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:27] cool, thx [20:17:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [20:18:12] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082078 (owner: 10TrainBranchBot) [20:19:28] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [20:21:07] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081445|Re-apply "Set special footer licence message for MediaWiki.org re. Help: pages" (T301483)]] (duration: 09m 48s) [20:21:24] T301483: Change the copyright warning for Mediawikiwiki's Help: namespace to CC0 - https://phabricator.wikimedia.org/T301483 [20:22:30] (03PS1) 10Daimona Eaytoy: Enable CampaignEvents collaboration list in testwiki and test2wiki (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082082 (https://phabricator.wikimedia.org/T376055) [20:22:31] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [20:22:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082057 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [20:23:10] thanks tgr|away [20:23:33] (03Merged) 10jenkins-bot: frwiki: switch clearing link recommendations to PageSaveComplete hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082057 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [20:23:38] I think there is nothing to test for my change. We will see its effects in the Grafana logs over time [20:23:59] as long as there are no errors, we should be in the clear [20:24:16] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1082057|frwiki: switch clearing link recommendations to PageSaveComplete hook (T372337)]] [20:24:19] (and we already toggled this flag for eswiki, so I'm not expecting anything) [20:24:23] T372337: High number of dangling search index results at fr.wikipedia or it.wikipedia - https://phabricator.wikimedia.org/T372337 [20:25:35] 10SRE-Access-Requests: Give Dumps 1.0 access to gmodena - https://phabricator.wikimedia.org/T377773#10248358 (10Nemoralis) [20:26:30] !log tgr@deploy2002 migr, tgr: Backport for [[gerrit:1082057|frwiki: switch clearing link recommendations to PageSaveComplete hook (T372337)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:28:00] !log tgr@deploy2002 migr, tgr: Continuing with sync [20:29:53] (03PS9) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [20:30:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P70428 and previous config saved to /var/cache/conftool/dbconfig/20241021-203029-ladsgroup.json [20:30:38] (03PS2) 10Daimona Eaytoy: Enable CampaignEvents collaboration list in testwiki and test2wiki (v2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082082 (https://phabricator.wikimedia.org/T376055) [20:32:36] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082057|frwiki: switch clearing link recommendations to PageSaveComplete hook (T372337)]] (duration: 08m 19s) [20:33:00] T372337: High number of dangling search index results at fr.wikipedia or it.wikipedia - https://phabricator.wikimedia.org/T372337 [20:33:13] (03CR) 10Volkanurl: "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1072655 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [20:33:21] (03CR) 10Volkanurl: "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1072655 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [20:33:27] (03PS2) 10CDanis: eqsin: haproxy: switch to array-type gpc_rate counters [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) [20:34:20] (03PS3) 10CDanis: eqsin: haproxy: switch to array-type gpc_rate counters [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) [20:34:26] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [20:34:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082058 (https://phabricator.wikimedia.org/T377476) (owner: 10Gergő Tisza) [20:37:40] (03Merged) 10jenkins-bot: fix(AuthManagerStatsd): counters require static set of labels [extensions/WikimediaEvents] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082058 (https://phabricator.wikimedia.org/T377476) (owner: 10Gergő Tisza) [20:37:57] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1082058|fix(AuthManagerStatsd): counters require static set of labels (T377476)]] [20:38:09] (03CR) 10Scott French: "Thanks for the review, Sukhbir!" [dns] - 10https://gerrit.wikimedia.org/r/1080779 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [20:38:17] T377476: PHP Warning: Stats: Cannot add labels to a metric containing samples for 'authmanager_error_total' - https://phabricator.wikimedia.org/T377476 [20:40:09] (03CR) 10Scott French: [C:03+2] wmnet: add DYNA records for mw-(web|api-ext)-next [dns] - 10https://gerrit.wikimedia.org/r/1080779 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [20:40:09] !log tgr@deploy2002 tgr: Backport for [[gerrit:1082058|fix(AuthManagerStatsd): counters require static set of labels (T377476)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:41:05] (03PS4) 10CDanis: eqsin: haproxy: switch to array-type gpc_rate counters [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) [20:41:06] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [20:44:03] tgr tgr|away I looked with mwdebug enabled at frwiki and did not notice any errors that would seem related to my change. [20:45:02] thanks MichaelG_WMF! [20:45:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T376905)', diff saved to https://phabricator.wikimedia.org/P70429 and previous config saved to /var/cache/conftool/dbconfig/20241021-204536-ladsgroup.json [20:45:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [20:45:47] (03PS10) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [20:45:54] I wonder how long it takes for a stat event to show up on grafana [20:45:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [20:46:00] (03PS1) 10Scott French: Revert "wmnet: add DYNA records for mw-(web|api-ext)-next" [dns] - 10https://gerrit.wikimedia.org/r/1082085 [20:46:03] minutes at most, I guess? [20:46:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T376905)', diff saved to https://phabricator.wikimedia.org/P70430 and previous config saved to /var/cache/conftool/dbconfig/20241021-204603-ladsgroup.json [20:46:21] (03PS5) 10CDanis: eqsin: haproxy: switch to array-type gpc_rate counters [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) [20:46:24] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [20:47:38] (03CR) 10Scott French: [C:03+2] Revert "wmnet: add DYNA records for mw-(web|api-ext)-next" [dns] - 10https://gerrit.wikimedia.org/r/1082085 (owner: 10Scott French) [20:48:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [20:50:14] (03CR) 10Scott French: "It looks like these never got pushed. Could someone with a bit more context on this run `authdns-update` to clear it? Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/1081371 (https://phabricator.wikimedia.org/T325409) (owner: 10Alexandros Kosiaris) [20:52:03] !log tgr@deploy2002 tgr: Continuing with sync [20:52:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T376905)', diff saved to https://phabricator.wikimedia.org/P70431 and previous config saved to /var/cache/conftool/dbconfig/20241021-205211-ladsgroup.json [20:55:20] (03PS11) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [20:56:41] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082058|fix(AuthManagerStatsd): counters require static set of labels (T377476)]] (duration: 18m 43s) [20:56:57] T377476: PHP Warning: Stats: Cannot add labels to a metric containing samples for 'authmanager_error_total' - https://phabricator.wikimedia.org/T377476 [20:57:38] (03CR) 10Volkanurl: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [20:57:49] (03CR) 10Volkanurl: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [20:58:39] (03PS12) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [20:59:20] (03CR) 10Fabfur: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [20:59:57] tgr|away: it should be pretty fast from something being recorded by graphite to being visible in Grafana. Though in our case, the data comes from a maint script that runs once per hour. Also, somehow events from maintenance scripts do not always make it to graphite and I have no idea why 🤷 [21:00:05] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241021T2100). [21:00:16] (03CR) 10Ssingh: "Confirming hosts have been decommissioned. I am taking the liberty to merge this change to unblock authdns-update." [dns] - 10https://gerrit.wikimedia.org/r/1081371 (https://phabricator.wikimedia.org/T325409) (owner: 10Alexandros Kosiaris) [21:00:41] !log UTC late deploys done [21:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:51] !log running authdns-update for CR 1081371 [21:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [21:01:37] MichaelG_WMF: sorry was just talking to myself. The other change was grafana-related (and apparently did not work). [21:03:37] tgr|away: Gotcha. Though in that case it maybe already goes via prometheus? No idea what delays that introduces. [21:04:18] (03PS13) 10Bking: statistics::explorer hosts: refactor and improve cgroups implementation [puppet] - 10https://gerrit.wikimedia.org/r/1082072 (https://phabricator.wikimedia.org/T377734) [21:07:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P70433 and previous config saved to /var/cache/conftool/dbconfig/20241021-210718-ladsgroup.json [21:07:48] (03CR) 10Ssingh: "Please note on why I decided to merge this (thanks to Scott for bringing it up):" [dns] - 10https://gerrit.wikimedia.org/r/1081371 (https://phabricator.wikimedia.org/T325409) (owner: 10Alexandros Kosiaris) [21:07:54] (03PS1) 10SBassett: Update miscweb: security-landing-page to latest image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082089 (https://phabricator.wikimedia.org/T377168) [21:08:00] swfrench-wmf: merged it and left a note on why it was done so. all yours [21:08:12] thanks for bringing it up. we will work on that alert, clearly, it's required [21:08:18] I will take care of it tomorrow [21:09:07] (03CR) 10JHathaway: [C:03+1] "Thanks sukhe, I didn't realize `authdns-update` had not been run, sorry" [dns] - 10https://gerrit.wikimedia.org/r/1081371 (https://phabricator.wikimedia.org/T325409) (owner: 10Alexandros Kosiaris) [21:09:30] sukhe: thank you very much! and yeah, having an alert for this sounds great. I'll revert my revert and get that pushed. [21:09:40] (03CR) 10Ssingh: "No worries, we will put an alert. Thanks for the confirmation Jesse!" [dns] - 10https://gerrit.wikimedia.org/r/1081371 (https://phabricator.wikimedia.org/T325409) (owner: 10Alexandros Kosiaris) [21:10:30] (03PS1) 10Scott French: Revert^2 "wmnet: add DYNA records for mw-(web|api-ext)-next" [dns] - 10https://gerrit.wikimedia.org/r/1082090 (https://phabricator.wikimedia.org/T377040) [21:12:43] (03CR) 10Scott French: [C:03+2] Revert^2 "wmnet: add DYNA records for mw-(web|api-ext)-next" [dns] - 10https://gerrit.wikimedia.org/r/1082090 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [21:13:44] (03PS1) 10Ladsgroup: tables-catalog: Remodel databases for non-default or non-core tables [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) [21:16:33] !log ran authdns-update to pick up mw-(web|api-ext)-next discovery records - T377040 [21:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:38] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [21:16:53] (03CR) 10Ladsgroup: "@Lucas: Hi, does this way of modeling cognate tables makes sense to you?" [puppet] - 10https://gerrit.wikimedia.org/r/1082091 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [21:22:07] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [21:22:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P70434 and previous config saved to /var/cache/conftool/dbconfig/20241021-212226-ladsgroup.json [21:24:47] (03Abandoned) 10Btullis: Datahub: disable standalone consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081948 (https://phabricator.wikimedia.org/T376657) (owner: 10Btullis) [21:25:07] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [21:27:01] (03CR) 10Scott French: "Thank you, sukhe!" [dns] - 10https://gerrit.wikimedia.org/r/1081371 (https://phabricator.wikimedia.org/T325409) (owner: 10Alexandros Kosiaris) [21:28:15] (03CR) 10Btullis: [C:03+1] "Looks good. I can merge this tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1082001 (https://phabricator.wikimedia.org/T377745) (owner: 10Gmodena) [21:37:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T376905)', diff saved to https://phabricator.wikimedia.org/P70435 and previous config saved to /var/cache/conftool/dbconfig/20241021-213733-ladsgroup.json [21:37:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [21:37:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [21:38:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T376905)', diff saved to https://phabricator.wikimedia.org/P70436 and previous config saved to /var/cache/conftool/dbconfig/20241021-213801-ladsgroup.json [21:44:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T376905)', diff saved to https://phabricator.wikimedia.org/P70437 and previous config saved to /var/cache/conftool/dbconfig/20241021-214405-ladsgroup.json [21:58:06] 06SRE-OnFire, 10Incident Tooling: Harden corto systemd service - https://phabricator.wikimedia.org/T372437#10248534 (10BCornwall) p:05Triage→03Low a:03BCornwall [21:59:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P70438 and previous config saved to /var/cache/conftool/dbconfig/20241021-215912-ladsgroup.json [21:59:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [22:00:09] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10248539 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye [22:01:29] FIRING: [16x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:04:10] jouncebot: nowandnext [22:04:10] For the next 0 hour(s) and 55 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241021T2100) [22:04:10] In 3 hour(s) and 55 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241022T0200) [22:09:39] (03CR) 10Zabe: [C:03+2] group0: Increase revision-slots cache expiry back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081464 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [22:09:47] (03CR) 10CI reject: [V:04-1] group0: Increase revision-slots cache expiry back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081464 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [22:09:59] (03PS2) 10Zabe: group0: Increase revision-slots cache expiry back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081464 (https://phabricator.wikimedia.org/T183490) [22:10:06] (03CR) 10Zabe: [C:03+2] group0: Increase revision-slots cache expiry back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081464 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [22:10:49] (03Merged) 10jenkins-bot: group0: Increase revision-slots cache expiry back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081464 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [22:11:22] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1081464|group0: Increase revision-slots cache expiry back to default (T183490)]] [22:11:27] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [22:13:38] !log zabe@deploy2002 zabe: Backport for [[gerrit:1081464|group0: Increase revision-slots cache expiry back to default (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:13:41] !log zabe@deploy2002 zabe: Continuing with sync [22:14:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P70439 and previous config saved to /var/cache/conftool/dbconfig/20241021-221419-ladsgroup.json [22:18:21] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081464|group0: Increase revision-slots cache expiry back to default (T183490)]] (duration: 06m 58s) [22:18:26] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [22:27:57] (03CR) 10Cwhite: [C:03+2] webperf: install php-mbstring [puppet] - 10https://gerrit.wikimedia.org/r/1081900 (https://phabricator.wikimedia.org/T377433) (owner: 10Máté Szabó) [22:29:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T376905)', diff saved to https://phabricator.wikimedia.org/P70440 and previous config saved to /var/cache/conftool/dbconfig/20241021-222926-ladsgroup.json [22:29:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2224.codfw.wmnet with reason: Maintenance [22:29:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2224.codfw.wmnet with reason: Maintenance [22:29:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T376905)', diff saved to https://phabricator.wikimedia.org/P70441 and previous config saved to /var/cache/conftool/dbconfig/20241021-222952-ladsgroup.json [22:35:24] (03CR) 10BCornwall: [V:03+1] varnish: Give 1% of views RSA cert warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [23:17:34] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1081985 (https://phabricator.wikimedia.org/T377292) (owner: 10RLazarus) [23:18:07] (03CR) 10Scott French: [C:03+1] "Looks good! One nit and one question." [puppet] - 10https://gerrit.wikimedia.org/r/1081986 (https://phabricator.wikimedia.org/T377292) (owner: 10RLazarus) [23:20:16] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bullseye [23:20:26] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10248726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye executed... [23:30:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T376905)', diff saved to https://phabricator.wikimedia.org/P70442 and previous config saved to /var/cache/conftool/dbconfig/20241021-233018-ladsgroup.json [23:35:00] (03PS1) 10BryanDavis: modules/admin: Add bd808 to contint-roots and contint-docker groups [puppet] - 10https://gerrit.wikimedia.org/r/1082105 (https://phabricator.wikimedia.org/T377792) [23:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1082106 [23:38:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1082106 (owner: 10TrainBranchBot) [23:44:42] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Grant bd808 membership in the contint-roots and contint-docker groups - https://phabricator.wikimedia.org/T377792#10248765 (10thcipriani) Approved as group approver in `data.yaml`. Tagging for SRE clinic duty (per [... [23:45:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P70443 and previous config saved to /var/cache/conftool/dbconfig/20241021-234525-ladsgroup.json