[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T0000) [00:03:13] the existing mw-videoscaler jobs aren't quite draining off as quickly as I'd like. we're hovering at about 75% utilization of shellbox-video replicas: https://grafana.wikimedia.org/goto/szRrmC4Dg?orgId=1 [00:04:37] I might go ahead and bump these to 50 replicas, which brings us up to the next multiple of 12 (6 workers per transcode jobs type), plus a small buffer. [00:07:11] (03PS1) 10Zabe: BETA: Add support for imagelinks read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225713 (https://phabricator.wikimedia.org/T413669) [00:09:25] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [00:10:34] let's see if this works. I may need to make some quota adjustments in order to get us to 50. [00:10:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223261 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [00:16:12] oh, great ... I just realized this is going to fail to apply, because maxUnavailable is defaulted to 25%. I could have sworn we'd overidden that to be absurdly high for exactly this scenario =/ [00:16:41] * swfrench-wmf patiently waits for it to time out, then will try again with maxUnavailable cranked up [00:18:09] (03PS1) 10HMonroy: [metawiki] enable voting on entities with the 'Under review' status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225716 (https://phabricator.wikimedia.org/T409613) [00:23:33] ... which is of course going to require not 1 but 2 10m timeouts, since the "revert" will also never pass the 25% maxUnavailable threshold [00:28:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87445 and previous config saved to /var/cache/conftool/dbconfig/20260113-002853-marostegui.json [00:28:59] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [00:28:59] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [00:29:29] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [00:29:44] alright, let's try that again [00:30:13] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [00:30:17] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [00:30:49] [x] step #1, set maxUnavailable absurdly high [00:31:40] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [00:31:46] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [00:31:53] (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225709 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [00:31:53] [x] step #2, moar [00:33:43] (03PS2) 10HMonroy: [metawiki] enable voting on entities with the 'Under review' status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225716 (https://phabricator.wikimedia.org/T409613) [00:34:00] (03Merged) 10jenkins-bot: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225709 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [00:34:11] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:35:04] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [00:35:20] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:35:22] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [00:35:40] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [00:35:42] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:35:58] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:39:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P87446 and previous config saved to /var/cache/conftool/dbconfig/20260113-003901-marostegui.json [00:40:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1225722 [00:40:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1225722 (owner: 10TrainBranchBot) [00:43:53] swfrench-wmf: is it fine to deploy now? [00:45:27] zabe: if it's just one deployment, we're probably in decent shape, yes. if you have a couple that need to go out in short succession, then it would be preferable to hold. [00:46:08] its just one [00:46:14] I was going to deploy 1225716 [00:46:46] jouncebot: nowandnext [00:46:46] For the next 0 hour(s) and 13 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T0000) [00:46:46] In 2 hour(s) and 13 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T0300) [00:47:32] zabe: TimStarling: are these two trivially safe to batch together, or is it better to separate them? [00:47:48] (mine is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1223165) [00:47:55] from my pov it should be fine [00:48:33] maybe best to do it separately, I'll wait [00:49:05] alright [00:49:09] (03CR) 10Zabe: [C:03+2] Start writing to il_target_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223165 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe) [00:49:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P87447 and previous config saved to /var/cache/conftool/dbconfig/20260113-004910-marostegui.json [00:50:00] (03Merged) 10jenkins-bot: Start writing to il_target_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223165 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe) [00:51:27] FYI, I'll be making some additional capacity tweaks on shellbox-video in the background. no concerns with those overlapping with deploys. [00:51:35] TimStarling: could you release your lock in that case? [00:51:50] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [00:51:56] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [00:53:13] sorry [00:53:27] go ahead now zabe [00:53:31] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1225722 (owner: 10TrainBranchBot) [00:53:34] no worries, thanks:) [00:53:39] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1223165|Start writing to il_target_id on testwiki (T413526)]] [00:53:43] T413526: Set imagelinks migration to write both - https://phabricator.wikimedia.org/T413526 [00:55:44] !log zabe@deploy2002 zabe: Backport for [[gerrit:1223165|Start writing to il_target_id on testwiki (T413526)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:57:57] !log zabe@deploy2002 zabe: Continuing with sync [00:59:11] (03PS1) 10Scott French: shellbox-video: bump replicas due to backlog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225726 [00:59:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87448 and previous config saved to /var/cache/conftool/dbconfig/20260113-005918-marostegui.json [00:59:24] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [00:59:24] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [00:59:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance [00:59:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1248 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87449 and previous config saved to /var/cache/conftool/dbconfig/20260113-005943-marostegui.json [01:01:59] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223165|Start writing to il_target_id on testwiki (T413526)]] (duration: 08m 19s) [01:02:03] T413526: Set imagelinks migration to write both - https://phabricator.wikimedia.org/T413526 [01:02:46] ChrisDobbins901_: if you're still around, could I ask you for a review on https://gerrit.wikimedia.org/r/1225726 to lock in the upsize to shellbox-video? [01:04:08] looking [01:05:58] (03CR) 10CDobbins: [C:03+2] shellbox-video: bump replicas due to backlog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225726 (owner: 10Scott French) [01:06:30] thanks, ChrisDobbins901_! [01:06:57] np. thank you for doing the heavy lifting :) [01:07:45] (03Merged) 10jenkins-bot: shellbox-video: bump replicas due to backlog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225726 (owner: 10Scott French) [01:09:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225716 (https://phabricator.wikimedia.org/T409613) (owner: 10HMonroy) [01:09:57] (03Merged) 10jenkins-bot: [metawiki] enable voting on entities with the 'Under review' status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225716 (https://phabricator.wikimedia.org/T409613) (owner: 10HMonroy) [01:10:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1225732 [01:10:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1225732 (owner: 10TrainBranchBot) [01:10:45] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1225716|[metawiki] enable voting on entities with the 'Under review' status (T409613)]] [01:10:48] T409613: Support voting for wishes under review - https://phabricator.wikimedia.org/T409613 [01:11:02] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [01:11:31] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [01:12:48] !log tstarling@deploy2002 tstarling, hmonroy: Backport for [[gerrit:1225716|[metawiki] enable voting on entities with the 'Under review' status (T409613)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:17:10] !log tstarling@deploy2002 tstarling, hmonroy: Continuing with sync [01:20:51] PROBLEM - Host cp5022 is DOWN: PING CRITICAL - Packet loss = 100% [01:21:10] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225716|[metawiki] enable voting on entities with the 'Under review' status (T409613)]] (duration: 10m 26s) [01:21:14] T409613: Support voting for wishes under review - https://phabricator.wikimedia.org/T409613 [01:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:29:18] (03CR) 10Zabe: [C:03+2] BETA: Add support for imagelinks read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225713 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [01:29:43] (03PS2) 10Zabe: BETA: Set imagelinks to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225713 (https://phabricator.wikimedia.org/T413669) [01:29:46] (03CR) 10Zabe: [C:03+2] BETA: Set imagelinks to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225713 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [01:30:36] (03Merged) 10jenkins-bot: BETA: Set imagelinks to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225713 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [01:33:19] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1225732 (owner: 10TrainBranchBot) [01:45:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225596 (https://phabricator.wikimedia.org/T414277) (owner: 10Seawolf35gerrit) [01:48:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225596 (https://phabricator.wikimedia.org/T414277) (owner: 10Seawolf35gerrit) [01:49:45] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11514946 (10KFrancis) Hi all, the NDA is complete. Thanks! [02:09:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.11 [core] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1225755 (https://phabricator.wikimedia.org/T413802) [02:09:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.11 [core] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1225755 (https://phabricator.wikimedia.org/T413802) (owner: 10TrainBranchBot) [02:13:03] 06SRE, 10envoy, 10ServiceOps-Services-Oids, 10ServiceOps new: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975#11514957 (10RLazarus) 05Open→03In progress p:05Triage→03Medium [02:21:53] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.11 [core] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1225755 (https://phabricator.wikimedia.org/T413802) (owner: 10TrainBranchBot) [02:26:06] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp5022.eqsin.wmnet [reason: host down] [02:27:30] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on cp5022.eqsin.wmnet with reason: host down [02:36:44] 06SRE, 10MinT, 10Prod-Kubernetes, 06LPL Essential (FY2025-26 Q3), and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11514986 (10RLazarus) [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T0300) [03:06:41] PROBLEM - Host an-worker1148 is DOWN: PING CRITICAL - Packet loss = 100% [03:39:30] ryankemper@cumin2002 reboot-workers (PID 1421267) is awaiting input [03:59:34] (03PS3) 10Santiago Faci: Deploy TestKitchen to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [04:00:04] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T0400) [04:02:05] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225786 (https://phabricator.wikimedia.org/T413802) [04:02:08] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225786 (https://phabricator.wikimedia.org/T413802) (owner: 10TrainBranchBot) [04:03:03] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225786 (https://phabricator.wikimedia.org/T413802) (owner: 10TrainBranchBot) [04:03:32] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.11 refs T413802 [04:03:36] T413802: 1.46.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T413802 [04:18:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:29:51] PROBLEM - Host hcaptcha-proxy7002 is DOWN: CRITICAL - Time to live exceeded (195.200.68.103) [04:29:59] PROBLEM - Host asw1-b4-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.131) [04:29:59] PROBLEM - Host asw1-b3-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.130) [04:30:11] PROBLEM - Host install7002 is DOWN: CRITICAL - Time to live exceeded (195.200.68.100) [04:30:13] PROBLEM - Host doh7003 is DOWN: CRITICAL - Time to live exceeded (195.200.68.98) [04:30:13] PROBLEM - Host doh7004 is DOWN: CRITICAL - Time to live exceeded (195.200.68.101) [04:30:17] RECOVERY - Host asw1-b3-magru is UP: PING OK - Packet loss = 0%, RTA = 142.03 ms [04:30:17] RECOVERY - Host asw1-b4-magru is UP: PING OK - Packet loss = 0%, RTA = 141.74 ms [04:30:21] RECOVERY - Host hcaptcha-proxy7002 is UP: PING OK - Packet loss = 0%, RTA = 138.10 ms [04:30:21] RECOVERY - Host doh7004 is UP: PING OK - Packet loss = 0%, RTA = 138.14 ms [04:30:21] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 138.15 ms [04:30:37] RECOVERY - Host install7002 is UP: PING OK - Packet loss = 0%, RTA = 138.03 ms [04:33:15] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:34:11] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:47:49] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.11 refs T413802 (duration: 44m 17s) [04:47:53] T413802: 1.46.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T413802 [04:57:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T0500) [05:04:13] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.5 (duration: 04m 11s) [05:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:11] (03PS1) 10KartikMistry: Update cxserver to 2026-01-09-231405-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225807 (https://phabricator.wikimedia.org/T414237) [05:18:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:22:35] (03PS1) 10Giuseppe Lavagetto: cache::text: raise the global_auth_ratelimit due to user reports [puppet] - 10https://gerrit.wikimedia.org/r/1225812 [05:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:34:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:07] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:48:17] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1225538 (owner: 10L10n-bot) [05:55:00] (03PS2) 10Giuseppe Lavagetto: cache::text: raise the global_auth_ratelimit due to user reports [puppet] - 10https://gerrit.wikimedia.org/r/1225812 [05:58:51] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] cache::text: raise the global_auth_ratelimit due to user reports [puppet] - 10https://gerrit.wikimedia.org/r/1225812 (owner: 10Giuseppe Lavagetto) [06:13:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2212.codfw.wmnet with reason: Maintenance [06:13:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:18:41] (03CR) 10Marostegui: [C:03+1] Remove profile::puppet::agent::force_puppet7 from DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225576 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [06:20:02] FIRING: SystemdUnitFailed: netbox_ganeti_magru03_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:21:57] (03PS1) 10Shivaansh Singh: Add Comments namespace for shnwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) [06:22:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:28:02] (03PS1) 10Marostegui: db2249: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1226025 [06:29:08] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1226025 (owner: 10Marostegui) [06:29:10] (03CR) 10Marostegui: [C:03+2] db2249: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1226025 (owner: 10Marostegui) [06:34:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_magru03_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:39:38] !log push pfw policies - T414393 [06:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:07] !log marostegui@cumin1003 START - Cookbook sre.mysql.newdepool depool db2142: Schema change [06:48:07] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:48:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:48:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) depool db2142: Schema change [06:50:04] !log Deploy schema change on ms1 T411497 [06:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:08] T411497: Drop modtoken and flags columns from cache tables - https://phabricator.wikimedia.org/T411497 [06:50:46] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2142: After Schema change [06:50:46] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:50:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:50:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db2142: After Schema change [06:52:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:55:08] (03PS3) 10Giuseppe Lavagetto: cache:haproxy: Add Lua-based contact info extraction [puppet] - 10https://gerrit.wikimedia.org/r/1225536 (https://phabricator.wikimedia.org/T414300) [06:55:55] (03CR) 10Giuseppe Lavagetto: cache:haproxy: Add Lua-based contact info extraction (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1225536 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto) [06:57:12] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7867/co" [puppet] - 10https://gerrit.wikimedia.org/r/1225536 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T0700) [07:00:05] marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T0700). [07:05:29] !log marostegui@cumin1003 START - Cookbook sre.mysql.newdepool depool db2144: Schema change [07:05:29] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [07:05:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:05:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) depool db2144: Schema change [07:06:04] !log Deploy schema change on ms2 T411497 [07:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:08] T411497: Drop modtoken and flags columns from cache tables - https://phabricator.wikimedia.org/T411497 [07:06:33] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2144: After Schema change [07:06:33] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [07:06:33] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7868/co" [puppet] - 10https://gerrit.wikimedia.org/r/1225536 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto) [07:06:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:06:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db2144: After Schema change [07:07:09] (03CR) 10Marostegui: "I've been using these cookbooks for msX and they look good to me. I think once we've added the x1 support, we can test and and merge." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [07:10:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.newdepool depool db2143: Schema change [07:10:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [07:10:31] !log Deploy schema change on ms3 T411497 [07:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:10:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) depool db2143: Schema change [07:11:05] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2143: After Schema change [07:11:05] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [07:11:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:11:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db2143: After Schema change [07:18:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:11] (03CR) 10Filippo Giunchedi: "LGTM, I'll let o11y folks vote" [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [07:21:48] 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11515258 (10fgiunchedi) Adding back ops-eqiad for visibility [07:27:11] (03PS1) 10Filippo Giunchedi: pontoon: reduce postgres wal segments [puppet] - 10https://gerrit.wikimedia.org/r/1226037 [07:34:54] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host an-test-druid1001.eqiad.wmnet with OS bookworm [07:34:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1226037 (owner: 10Filippo Giunchedi) [07:46:04] (03CR) 10Brouberol: "re HDFS vs helm: yes, I think it's better to go the helm way. The reason why that is is that _all_ our Kubernetes-related secrets are defi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) (owner: 10DCausse) [07:47:02] (03PS1) 10Muehlenhoff: Rename perf-team access group and assign approver [puppet] - 10https://gerrit.wikimedia.org/r/1226039 (https://phabricator.wikimedia.org/T276465) [07:48:10] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-druid1001.eqiad.wmnet with reason: host reimage [07:48:40] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host dbproxy2005.codfw.wmnet with OS trixie [07:48:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:53:49] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-druid1001.eqiad.wmnet with reason: host reimage [07:58:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:59:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226039 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [08:00:04] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T0800). [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:04:52] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2005.codfw.wmnet with reason: host reimage [08:08:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2005.codfw.wmnet with reason: host reimage [08:15:22] (03CR) 10Krinkle: [C:03+1] Rename perf-team access group and assign approver [puppet] - 10https://gerrit.wikimedia.org/r/1226039 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [08:24:42] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: reduce postgres wal segments [puppet] - 10https://gerrit.wikimedia.org/r/1226037 (owner: 10Filippo Giunchedi) [08:27:57] FIRING: ProbeDown: Service mathoid:4001 has failed probes (http_mathoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mathoid:4001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:28:57] !incidents [08:28:58] 7327 (UNACKED) ProbeDown sre (10.2.1.20 ip4 mathoid:4001 probes/service http_mathoid_ip4 codfw) [08:28:58] 7326 (RESOLVED) ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw) [08:28:58] 7325 (RESOLVED) Manual (paged) by RLazarus (rlazarus@wikimedia.org): vopsbot test page, please ignore [08:29:09] !ach 7327 [08:29:11] FIRING: ProbeDown: Service mathoid:4001 has failed probes (http_mathoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mathoid:4001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:29:23] !ack 7327 [08:29:24] 7327 (ACKED) ProbeDown sre (10.2.1.20 ip4 mathoid:4001 probes/service http_mathoid_ip4 codfw) [08:29:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:31:05] !log brouberol@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-druid1001.eqiad.wmnet with OS bookworm [08:31:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2005.codfw.wmnet with OS trixie [08:33:13] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [08:33:28] !incidents [08:33:29] 7327 (ACKED) ProbeDown sre (10.2.1.20 ip4 mathoid:4001 probes/service http_mathoid_ip4 codfw) [08:33:29] 7328 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [08:33:29] 7326 (RESOLVED) ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw) [08:33:29] 7325 (RESOLVED) Manual (paged) by RLazarus (rlazarus@wikimedia.org): vopsbot test page, please ignore [08:33:42] !ack 7328 [08:33:42] 7328 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [08:34:11] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:36:45] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225525 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:39:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:42:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db[2160,2232].codfw.wmnet with reason: testing [08:43:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [08:43:31] !log Stop mariadb on db2232 (m5) for testing dbproxy2005 T409398 [08:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:35] T409398: Test Debian Trixie for dbproxy role - https://phabricator.wikimedia.org/T409398 [08:48:58] (03CR) 10JMeybohm: [C:03+1] "LGTM. Please make sure to first deply and test this on a depooled registry node." [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [08:49:48] (03CR) 10Dpogorzelski: [C:03+2] docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [08:52:09] (03PS3) 10DCausse: airflow-search: add enterprise extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) [08:52:13] (03CR) 10DCausse: "thanks again, if/when you have a moment I'd love some help to get this deployed (esp. for the private repo, I don't think I can make commi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) (owner: 10DCausse) [08:52:57] RESOLVED: ProbeDown: Service mathoid:4001 has failed probes (http_mathoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mathoid:4001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:54:57] FIRING: ProbeDown: Service mathoid:4001 has failed probes (http_mathoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mathoid:4001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:56:08] !incidents [08:56:08] 7329 (ACKED) [2x] ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet) [08:56:08] 7330 (ACKED) ProbeDown sre (10.2.1.20 ip4 mathoid:4001 probes/service http_mathoid_ip4 codfw) [08:56:09] 7327 (RESOLVED) ProbeDown sre (10.2.1.20 ip4 mathoid:4001 probes/service http_mathoid_ip4 codfw) [08:56:09] 7328 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [08:56:09] 7326 (RESOLVED) ProbeDown sre (10.2.1.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 codfw) [08:56:09] 7325 (RESOLVED) Manual (paged) by RLazarus (rlazarus@wikimedia.org): vopsbot test page, please ignore [08:59:11] RESOLVED: ProbeDown: Service mathoid:4001 has failed probes (http_mathoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mathoid:4001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:59:51] RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:59:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://mathoid.svc.codfw.wmnet:4001 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:59:57] RESOLVED: ProbeDown: Service mathoid:4001 has failed probes (http_mathoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mathoid:4001 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:02:39] PROBLEM - Host doh7004 is DOWN: CRITICAL - Time to live exceeded (195.200.68.101) [09:02:39] PROBLEM - Host doh7003 is DOWN: CRITICAL - Time to live exceeded (195.200.68.98) [09:03:13] RECOVERY - Host doh7004 is UP: PING OK - Packet loss = 0%, RTA = 138.09 ms [09:03:13] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 138.01 ms [09:03:21] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] cache:haproxy: Add Lua-based contact info extraction [puppet] - 10https://gerrit.wikimedia.org/r/1225536 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto) [09:05:17] PROBLEM - Docker registry HTTPS interface on registry1005 is CRITICAL: connect to address 10.64.0.151 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [09:05:35] PROBLEM - Docker registry HTTPS interface certificate expiry on registry1005 is CRITICAL: connect to address 10.64.0.151 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [09:08:57] PROBLEM - Docker registry HTTPS interface certificate expiry on registry2004 is CRITICAL: connect to address 10.192.16.49 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [09:09:10] FIRING: BFDdown: BFD session down between cr3-ulsfo and fe80::ee38:73ff:fee7:bc66 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:09:17] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: connect to address 10.192.16.49 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [09:12:40] dpogorzelski: is that from you by chance? ^ [09:14:10] RESOLVED: BFDdown: BFD session down between cr3-ulsfo and fe80::ee38:73ff:fee7:bc66 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:14:33] 10ops-eqsin, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411 (10Vgutierrez) 03NEW [09:14:49] 10ops-eqsin, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11515417 (10Vgutierrez) p:05Triage→03Medium [09:16:11] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - docker-registry_443: Servers registry1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:16:17] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: connect to address 10.64.32.143 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [09:16:35] PROBLEM - Docker registry HTTPS interface certificate expiry on registry1004 is CRITICAL: connect to address 10.64.32.143 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [09:16:57] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - docker-registry_443: Servers registry1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:16:57] FIRING: ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:18:03] dpogorzelski: I'm pretty sure it's your change that breaks the registry. I will roll back [09:18:20] (03PS1) 10JMeybohm: Revert "docker registry: add ml build user password" [puppet] - 10https://gerrit.wikimedia.org/r/1226165 [09:19:41] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - docker-registry_443: Servers registry2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:19:41] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - docker-registry_443: Servers registry2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:19:53] (03CR) 10JMeybohm: [V:03+2 C:03+2] Revert "docker registry: add ml build user password" [puppet] - 10https://gerrit.wikimedia.org/r/1226165 (owner: 10JMeybohm) [09:19:57] PROBLEM - Docker registry HTTPS interface certificate expiry on registry2005 is CRITICAL: connect to address 10.192.16.7 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [09:20:17] PROBLEM - Docker registry HTTPS interface on registry2005 is CRITICAL: connect to address 10.192.16.7 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [09:21:57] FIRING: [2x] ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:22:41] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:22:57] RECOVERY - Docker registry HTTPS interface certificate expiry on registry2004 is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Sun 25 Jan 2026 08:02:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker [09:23:17] RECOVERY - Docker registry HTTPS interface on registry1005 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Docker [09:23:17] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Docker [09:23:17] RECOVERY - Docker registry HTTPS interface on registry2005 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 0.317 second response time https://wikitech.wikimedia.org/wiki/Docker [09:23:20] thanks jayme [09:23:35] RECOVERY - Docker registry HTTPS interface certificate expiry on registry1005 is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Mon 09 Feb 2026 12:27:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker [09:23:35] RECOVERY - Docker registry HTTPS interface certificate expiry on registry1004 is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Sun 25 Jan 2026 06:39:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker [09:23:41] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:23:57] RECOVERY - Docker registry HTTPS interface certificate expiry on registry2005 is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Mon 02 Feb 2026 11:43:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker [09:24:11] FIRING: ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:24:17] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Docker [09:25:07] RESOLVED: [2x] ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:25:57] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:26:57] RESOLVED: [2x] ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:27:11] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:29:52] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225576 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:31:38] (03PS1) 10Giuseppe Lavagetto: cache::text: rollout lua-based contact info extraction [puppet] - 10https://gerrit.wikimedia.org/r/1226172 (https://phabricator.wikimedia.org/T414300) [09:31:40] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: cleanup parameter lua_contact_info after rollout [puppet] - 10https://gerrit.wikimedia.org/r/1226173 (https://phabricator.wikimedia.org/T414300) [09:34:13] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:34:55] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414413 (10LSobanski) 03NEW [09:37:03] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from collab roles [puppet] - 10https://gerrit.wikimedia.org/r/1226175 (https://phabricator.wikimedia.org/T365798) [09:37:27] (03CR) 10Fabfur: [C:03+1] cache::haproxy: cleanup parameter lua_contact_info after rollout [puppet] - 10https://gerrit.wikimedia.org/r/1226173 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto) [09:37:48] (03CR) 10Fabfur: [C:03+1] cache::text: rollout lua-based contact info extraction [puppet] - 10https://gerrit.wikimedia.org/r/1226172 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto) [09:39:44] (03CR) 10Vgutierrez: [C:03+1] cache::text: rollout lua-based contact info extraction [puppet] - 10https://gerrit.wikimedia.org/r/1226172 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto) [09:39:50] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from ML roles [puppet] - 10https://gerrit.wikimedia.org/r/1226176 (https://phabricator.wikimedia.org/T365798) [09:39:56] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::text: rollout lua-based contact info extraction [puppet] - 10https://gerrit.wikimedia.org/r/1226172 (https://phabricator.wikimedia.org/T414300) (owner: 10Giuseppe Lavagetto) [09:40:01] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:41:20] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2007.codfw.wmnet with OS bookworm [09:42:47] (03PS1) 10Brouberol: Build for Bookworm [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/1226177 [09:45:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87457 and previous config saved to /var/cache/conftool/dbconfig/20260113-094502-marostegui.json [09:45:08] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [09:45:08] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [09:48:06] (03PS2) 10Brouberol: Build for Bookworm [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/1226177 [09:48:07] (03CR) 10Btullis: [C:03+1] Build for Bookworm [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/1226177 (owner: 10Brouberol) [09:48:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/1226177 (owner: 10Brouberol) [09:49:46] (03CR) 10Brouberol: [C:03+2] Build for Bookworm [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/1226177 (owner: 10Brouberol) [09:49:49] (03CR) 10Brouberol: [V:03+2 C:03+2] Build for Bookworm [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/1226177 (owner: 10Brouberol) [09:50:55] (03CR) 10Arnaudb: [C:03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1226175 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:52:49] (03CR) 10Blake: [C:03+2] switchdc: Delete services cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/1225500 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [09:55:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P87458 and previous config saved to /var/cache/conftool/dbconfig/20260113-095510-marostegui.json [09:55:31] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11515509 (10Fabfur) @Xqt we're rolling out a change that should lift the current ratelimiting and impact Pywikibot too, could you please check in ~30 minutes if yo... [09:57:49] (03Merged) 10jenkins-bot: switchdc: Delete services cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/1225500 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [09:57:52] (03PS1) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from observability roles [puppet] - 10https://gerrit.wikimedia.org/r/1226178 (https://phabricator.wikimedia.org/T365798) [10:05:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P87459 and previous config saved to /var/cache/conftool/dbconfig/20260113-100519-marostegui.json [10:06:50] !log revoked legacy similar-users discovery certificate T365798 [10:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:53] T365798: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798 [10:14:11] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:15:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87460 and previous config saved to /var/cache/conftool/dbconfig/20260113-101528-marostegui.json [10:15:34] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [10:15:34] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [10:15:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2246.codfw.wmnet with reason: Maintenance [10:15:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2246 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87461 and previous config saved to /var/cache/conftool/dbconfig/20260113-101552-marostegui.json [10:19:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:21:34] (03PS1) 10Dzahn: admin: document LDAP access for Kim Pham of WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1226189 (https://phabricator.wikimedia.org/T414157) [10:22:31] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2007.codfw.wmnet with OS bookworm [10:24:14] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1226189 (https://phabricator.wikimedia.org/T414157) (owner: 10Dzahn) [10:24:19] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2007.codfw.wmnet with OS bookworm [10:24:32] (03CR) 10Dzahn: [C:03+2] admin: document LDAP access for Kim Pham of WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1226189 (https://phabricator.wikimedia.org/T414157) (owner: 10Dzahn) [10:24:52] (03PS1) 10Vgutierrez: traffic: ignore MSS values of 0 on LVSRealserverMSS [alerts] - 10https://gerrit.wikimedia.org/r/1226190 (https://phabricator.wikimedia.org/T400155) [10:28:12] (03PS1) 10JMeybohm: aptrepo: Add k8s 1.31 related components to trixie [puppet] - 10https://gerrit.wikimedia.org/r/1226191 (https://phabricator.wikimedia.org/T414417) [10:28:18] 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11515627 (10Joe) Hi, I still see a lot of requests from your IPs with user-agent `Faraday v2.14.0`. These are calls to `//w/api.php`, `/w/api.php`, `/w/index.php` in... [10:29:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1226191 (https://phabricator.wikimedia.org/T414417) (owner: 10JMeybohm) [10:29:39] (03CR) 10JMeybohm: [C:03+2] aptrepo: Add k8s 1.31 related components to trixie [puppet] - 10https://gerrit.wikimedia.org/r/1226191 (https://phabricator.wikimedia.org/T414417) (owner: 10JMeybohm) [10:30:00] (03CR) 10Fabfur: [C:03+1] traffic: ignore MSS values of 0 on LVSRealserverMSS [alerts] - 10https://gerrit.wikimedia.org/r/1226190 (https://phabricator.wikimedia.org/T400155) (owner: 10Vgutierrez) [10:30:55] !log LDAP - add kimpham to groups wmde and nda (T414157) [10:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:59] T414157: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157 [10:32:32] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11515640 (10Dzahn) 05Open→03Resolved a:03Dzahn Thanks Katie! Hi @kimpham you have now been added to the requested groups "nda" and "wmde", like other WMDE staff and with the different pr... [10:36:36] (03PS1) 10Dzahn: admin: document LDAP access for Martyn Ranyard of WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1226192 (https://phabricator.wikimedia.org/T413994) [10:40:24] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11515678 (10Dzahn) [10:40:34] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11515679 (10Dzahn) [10:41:37] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226195 [10:43:30] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11515687 (10Dzahn) updating ticket because requested group analytics-privatedate-users isn't an LDAP group. Still to be determined which level 1, 2 or 3 is requested (see https://... [10:51:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87462 and previous config saved to /var/cache/conftool/dbconfig/20260113-105110-marostegui.json [10:51:16] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [10:51:16] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [10:54:39] (03CR) 10Elukey: [C:03+1] Remove profile::puppet::agent::force_puppet7 from ML roles [puppet] - 10https://gerrit.wikimedia.org/r/1226176 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:56:14] (03CR) 10Vgutierrez: [C:03+2] traffic: ignore MSS values of 0 on LVSRealserverMSS [alerts] - 10https://gerrit.wikimedia.org/r/1226190 (https://phabricator.wikimedia.org/T400155) (owner: 10Vgutierrez) [10:57:08] (03CR) 10Elukey: [C:03+2] profile::docker_registry: turn off backend redirects for Swift [puppet] - 10https://gerrit.wikimedia.org/r/1225526 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [10:58:51] (03PS1) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1226204 (https://phabricator.wikimedia.org/T412524) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T1100) [11:01:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P87463 and previous config saved to /var/cache/conftool/dbconfig/20260113-110119-marostegui.json [11:03:01] (03PS2) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1226204 (https://phabricator.wikimedia.org/T412524) [11:03:25] !log ayounsi@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host testvm2007.codfw.wmnet with OS bookworm [11:03:27] !log disable HTTP redirects to the Swift backend for all the Docker registries - T390251 [11:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:31] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [11:03:38] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2007.codfw.wmnet with OS bookworm [11:03:40] (03PS3) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1226204 (https://phabricator.wikimedia.org/T412524) [11:05:01] (03CR) 10Dpogorzelski: "Re opening my docker registry change, this time rebased on top of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1224091/3/modules/d" [puppet] - 10https://gerrit.wikimedia.org/r/1226204 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [11:08:10] 06SRE, 10TimedMediaHandler-Transcode: Increase capacity for Mercurius webvideoTranscode job (1080p) processing - https://phabricator.wikimedia.org/T414427 (10TheDJ) 03NEW [11:09:19] 06SRE, 10TimedMediaHandler-Transcode: Increase capacity for Mercurius webvideoTranscode job (1080p) processing - https://phabricator.wikimedia.org/T414427#11515861 (10TheDJ) [11:09:22] (03PS1) 10Clément Goubert: ratelimit: Update ratelimit service version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226208 (https://phabricator.wikimedia.org/T414002) [11:09:24] (03PS1) 10Clément Goubert: api-gateway: Update ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226209 (https://phabricator.wikimedia.org/T414002) [11:11:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P87464 and previous config saved to /var/cache/conftool/dbconfig/20260113-111127-marostegui.json [11:11:48] (03PS1) 10Dzahn: admin: fix uid for Kim Pham [puppet] - 10https://gerrit.wikimedia.org/r/1226210 (https://phabricator.wikimedia.org/T414418) [11:12:07] (03CR) 10CI reject: [V:04-1] admin: fix uid for Kim Pham [puppet] - 10https://gerrit.wikimedia.org/r/1226210 (https://phabricator.wikimedia.org/T414418) (owner: 10Dzahn) [11:13:41] (03CR) 10Arnaudb: [C:03+1] admin: document LDAP access for Martyn Ranyard of WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1226192 (https://phabricator.wikimedia.org/T413994) (owner: 10Dzahn) [11:14:15] (03CR) 10Hnowlan: [C:03+1] api-gateway: Update ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226209 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert) [11:14:23] (03CR) 10Dzahn: [C:03+2] admin: document LDAP access for Martyn Ranyard of WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1226192 (https://phabricator.wikimedia.org/T413994) (owner: 10Dzahn) [11:15:48] (03PS2) 10Dzahn: admin: fix uid for Kim Pham [puppet] - 10https://gerrit.wikimedia.org/r/1226210 (https://phabricator.wikimedia.org/T414418) [11:16:41] (03CR) 10Dzahn: [C:03+1] "[ldap-maint1001:~] $ ldapsearch -x mail=kim.pham*" [puppet] - 10https://gerrit.wikimedia.org/r/1226210 (https://phabricator.wikimedia.org/T414418) (owner: 10Dzahn) [11:17:26] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, context is https://idm.wikimedia.org/wikimedia/log/" [puppet] - 10https://gerrit.wikimedia.org/r/1226210 (https://phabricator.wikimedia.org/T414418) (owner: 10Dzahn) [11:17:39] (03CR) 10Dzahn: [C:03+1] "a search for mail=kim.pham* matches uid pham - uid kimpham has no mail" [puppet] - 10https://gerrit.wikimedia.org/r/1226210 (https://phabricator.wikimedia.org/T414418) (owner: 10Dzahn) [11:17:47] (03CR) 10Dzahn: [C:03+2] admin: fix uid for Kim Pham [puppet] - 10https://gerrit.wikimedia.org/r/1226210 (https://phabricator.wikimedia.org/T414418) (owner: 10Dzahn) [11:18:11] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage [11:18:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:21:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87465 and previous config saved to /var/cache/conftool/dbconfig/20260113-112134-marostegui.json [11:21:40] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [11:21:41] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [11:21:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance [11:22:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1249 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87466 and previous config saved to /var/cache/conftool/dbconfig/20260113-112159-marostegui.json [11:22:03] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm2008.wikimedia.org with OS bookworm [11:23:32] !log LDAP - add martynranyard to groups wmde and nda (T413994) [11:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:35] T413994: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994 [11:23:58] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994#11515962 (10Dzahn) 05Open→03Resolved a:03Dzahn Thanks Katie! Hi @Martyn.ranyard you have now been added to the requested groups "nda" and "wmde", like other WMDE s... [11:24:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage [11:25:53] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11515977 (10Dzahn) 05Open→03Stalled p:05Triage→03Medium [11:26:55] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11515989 (10Dzahn) a:03thcipriani [11:28:03] 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11515995 (10Dzahn) a:05DSantamaria→03None [11:35:04] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11516063 (10elukey) >>! In T250367#11511124, @ayounsi wrote: >> Is sretest2003 the only one that shows this behavior, or do we have others? I am particularly i... [11:38:48] (03CR) 10Btullis: [C:03+2] Add three new dse-k8s-workers in eqiad to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1225601 (https://phabricator.wikimedia.org/T414216) (owner: 10Btullis) [11:40:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2007.codfw.wmnet with OS bookworm [11:41:40] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-22] - https://phabricator.wikimedia.org/T414216#11516113 (10BTullis) >>! In T414216#11513961, @Jclark-ctr wrote: > @BTullis I see these servers in preseed, but when I check site.pp, they have... [11:42:59] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2008.wikimedia.org with reason: host reimage [11:44:04] 06SRE: New SRE manager - Get emails sent to noc - https://phabricator.wikimedia.org/T414223#11516135 (10Dzahn) @MLechvien-WMF Done! Please be aware this is the same as root@ and the mail volume is .. high. If you change your mind after a while just let us know and we can remove it again. [11:48:12] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2008.wikimedia.org with reason: host reimage [11:49:02] !log LDAP - fixed group membership in wmde and nda, kimpham -> pham [11:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:44] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11516190 (10elukey) To confirm how the Docker distribution sees the storage: ` elukey@registry2004:~$ curl localhost:5002/v2/_catalog {"repositories":["echoserver"... [12:00:36] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226218 [12:03:00] (03PS2) 10Ayounsi: Release v0.11.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1225671 [12:04:00] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2008.wikimedia.org with OS bookworm [12:06:55] (03PS1) 10Btullis: Update the Java version and other settings for the druid test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1226219 (https://phabricator.wikimedia.org/T278056) [12:07:57] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Update ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226209 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert) [12:07:59] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1226219 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [12:08:22] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1335.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:08:35] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1336.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:08:53] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1339.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:08:55] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1338.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:09:05] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1340.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:09:10] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1337.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:09:31] (03CR) 10Hnowlan: [C:03+1] Rename perf-team access group and assign approver [puppet] - 10https://gerrit.wikimedia.org/r/1226039 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [12:10:56] (03CR) 10Hnowlan: [C:03+2] thumbor: limit SVGs based on original file format, not output [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1212191 (https://phabricator.wikimedia.org/T411076) (owner: 10Hnowlan) [12:12:01] PROBLEM - Host an-conf1006 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:41] (03CR) 10Muehlenhoff: [C:03+2] Rename perf-team access group and assign approver [puppet] - 10https://gerrit.wikimedia.org/r/1226039 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [12:15:50] 06SRE: New SRE manager - Get emails sent to noc - https://phabricator.wikimedia.org/T414223#11516357 (10Dzahn) If you see the new emails and all seems ok feel free to close it as resolved. [12:16:25] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1340.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:16:32] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1336.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:16:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1338.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:16:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1335.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:16:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1337.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:16:59] 10SRE-SLO, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q3): Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273#11516364 (10hnowlan) [12:17:01] 10ops-codfw, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog2003 - https://phabricator.wikimedia.org/T412229#11516366 (10hnowlan) [12:17:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11516368 (10hnowlan) [12:19:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1339.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:19:47] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1341.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:19:51] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1343.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:19:53] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1345.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:19:55] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1344.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:19:57] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1342.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:20:22] (03PS1) 10Btullis: Fail back the hive services to an-coord1003 [dns] - 10https://gerrit.wikimedia.org/r/1226221 (https://phabricator.wikimedia.org/T303168) [12:21:12] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1346.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:22:00] (03Merged) 10jenkins-bot: thumbor: limit SVGs based on original file format, not output [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1212191 (https://phabricator.wikimedia.org/T411076) (owner: 10Hnowlan) [12:27:00] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1341.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:27:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1342.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:28:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1345.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:29:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1343.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:30:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1344.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:31:10] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1346.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:32:29] (03PS1) 10Filippo Giunchedi: kubernetes: conditional for rsyslog-k8s component [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) [12:32:31] jclark@cumin1003 provision (PID 1242847) is awaiting input [12:33:29] jclark@cumin1003 provision (PID 1242833) is awaiting input [12:34:19] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1349.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:34:42] (03CR) 10Filippo Giunchedi: "Supporting data" [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) (owner: 10Filippo Giunchedi) [12:37:14] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1350.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:37:26] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1351.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:37:28] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1352.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:37:30] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1353.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:37:32] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1354.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:37:44] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1352.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:38:15] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from ML roles [puppet] - 10https://gerrit.wikimedia.org/r/1226176 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:38:18] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1352.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:41:30] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1349.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:42:05] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1226219 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [12:42:09] (03CR) 10Brouberol: [C:03+1] Update the Java version and other settings for the druid test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1226219 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [12:42:16] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1355.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:42:23] (03CR) 10Joal: [C:03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/1226221 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [12:43:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1350.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:43:27] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1356.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:44:00] (03CR) 10JMeybohm: [C:03+1] kubernetes: conditional for rsyslog-k8s component [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) (owner: 10Filippo Giunchedi) [12:46:16] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1353.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:46:28] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11516468 (10MoritzMuehlenhoff) [12:46:30] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1357.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:46:52] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1352.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:47:05] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1358.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:47:44] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1351.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:47:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1354.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:48:30] (03CR) 10Brouberol: [C:03+1] Fail back the hive services to an-coord1003 [dns] - 10https://gerrit.wikimedia.org/r/1226221 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [12:49:01] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1359.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:49:08] (03CR) 10Btullis: [C:03+2] Fail back the hive services to an-coord1003 [dns] - 10https://gerrit.wikimedia.org/r/1226221 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [12:49:09] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) (owner: 10Filippo Giunchedi) [12:49:24] !log btullis@dns1004 START - running authdns-update [12:49:24] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1335.eqiad.wmnet with OS trixie [12:49:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516480 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1335.eqiad.wmnet with OS trixie [12:50:28] !log btullis@dns1004 END - running authdns-update [12:50:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1356.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:51:43] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1336.eqiad.wmnet with OS trixie [12:51:44] (03CR) 10Muehlenhoff: [C:03+1] "You could also simply remove it; we don't use any k8s on Bullseye anymore" [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) (owner: 10Filippo Giunchedi) [12:51:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516482 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1336.eqiad.wmnet with OS trixie [12:52:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1355.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:52:43] (03CR) 10Muehlenhoff: kubernetes: conditional for rsyslog-k8s component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) (owner: 10Filippo Giunchedi) [12:53:02] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1337.eqiad.wmnet with OS trixie [12:53:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1337.eqiad.wmnet with OS trixie [12:53:21] (03CR) 10Btullis: [V:03+1 C:03+2] Update the Java version and other settings for the druid test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1226219 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [12:53:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1357.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:53:45] 06SRE, 06Data-Platform-SRE, 06Data-Engineering (Q3 FY25/26 January 1st - March 31th): Move Druid realtime configuration out of Refinery into standalone repo on GitLab - https://phabricator.wikimedia.org/T407994#11516485 (10JMeybohm) [12:54:24] (03PS1) 10Majavah: P:wmcs: maintain_dbusers: Fix checking for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/1226231 (https://phabricator.wikimedia.org/T414452) [12:54:30] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1358.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:54:31] (03PS1) 10Isabelle Hurbain-Palatin: Turn on debugging for unsafe postproc cache entries logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) [12:54:42] 06SRE: New SRE manager - Get emails sent to noc - https://phabricator.wikimedia.org/T414223#11516487 (10JMeybohm) a:03MLechvien-WMF [12:55:22] (03CR) 10CI reject: [V:04-1] Turn on debugging for unsafe postproc cache entries logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) (owner: 10Isabelle Hurbain-Palatin) [12:55:34] (03CR) 10FNegri: [C:03+1] P:wmcs: maintain_dbusers: Fix checking for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/1226231 (https://phabricator.wikimedia.org/T414452) (owner: 10Majavah) [12:55:48] (03PS2) 10Isabelle Hurbain-Palatin: Turn on debugging for unsafe postproc cache entries logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) [12:56:00] (03PS2) 10Filippo Giunchedi: kubernetes: use Debian rsyslog-kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) [12:56:08] (03CR) 10Filippo Giunchedi: "Good point and even simpler, done" [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) (owner: 10Filippo Giunchedi) [12:56:15] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1340.eqiad.wmnet with OS trixie [12:56:18] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1339.eqiad.wmnet with OS trixie [12:56:19] (03CR) 10Majavah: [C:03+2] P:wmcs: maintain_dbusers: Fix checking for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/1226231 (https://phabricator.wikimedia.org/T414452) (owner: 10Majavah) [12:56:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516503 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1340.eqiad.wmnet with OS trixie [12:56:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1339.eqiad.wmnet with OS trixie [12:56:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1359.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:56:35] (03CR) 10CI reject: [V:04-1] Turn on debugging for unsafe postproc cache entries logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) (owner: 10Isabelle Hurbain-Palatin) [12:57:00] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1341.eqiad.wmnet with OS trixie [12:57:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1341.eqiad.wmnet with OS trixie [12:57:27] (03PS3) 10Isabelle Hurbain-Palatin: Turn on debugging for unsafe postproc cache entries logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) [12:59:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516522 (10Jclark-ctr) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T1300) [13:00:20] (03CR) 10Muehlenhoff: kubernetes: use Debian rsyslog-kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) (owner: 10Filippo Giunchedi) [13:00:49] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1335.eqiad.wmnet with reason: host reimage [13:03:20] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1336.eqiad.wmnet with reason: host reimage [13:04:17] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1337.eqiad.wmnet with reason: host reimage [13:04:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1335.eqiad.wmnet with reason: host reimage [13:05:12] !log installing gnup2 security updates [13:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:31] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1339.eqiad.wmnet with reason: host reimage [13:07:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1336.eqiad.wmnet with reason: host reimage [13:07:56] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1340.eqiad.wmnet with reason: host reimage [13:08:10] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1341.eqiad.wmnet with reason: host reimage [13:11:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1339.eqiad.wmnet with reason: host reimage [13:14:10] (03PS1) 10Muehlenhoff: ganeti/magru: Switch to dnsmasq as the DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1226239 (https://phabricator.wikimedia.org/T396864) [13:14:13] (03PS1) 10Muehlenhoff: ganeti/esams: Switch to dnsmasq as the DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1226240 (https://phabricator.wikimedia.org/T396864) [13:17:38] (03PS1) 10Majavah: P:wmcs: maintain_dbusers: Filter disabled users earlier [puppet] - 10https://gerrit.wikimedia.org/r/1226243 [13:17:44] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1340.eqiad.wmnet with reason: host reimage [13:18:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:27] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:20:45] (03PS3) 10Filippo Giunchedi: kubernetes: use Debian rsyslog-kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) [13:20:47] (03CR) 10Filippo Giunchedi: kubernetes: use Debian rsyslog-kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) (owner: 10Filippo Giunchedi) [13:21:25] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1341.eqiad.wmnet with reason: host reimage [13:22:12] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) (owner: 10Filippo Giunchedi) [13:22:30] !log installing squid security updates [13:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:07] (03CR) 10Filippo Giunchedi: [C:03+2] kubernetes: use Debian rsyslog-kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1226224 (https://phabricator.wikimedia.org/T414417) (owner: 10Filippo Giunchedi) [13:23:31] jclark@cumin1003 reimage (PID 1244016) is awaiting input [13:23:38] (03CR) 10FNegri: [C:03+1] P:wmcs: maintain_dbusers: Filter disabled users earlier [puppet] - 10https://gerrit.wikimedia.org/r/1226243 (owner: 10Majavah) [13:23:52] (03CR) 10Majavah: [C:03+2] P:wmcs: maintain_dbusers: Filter disabled users earlier [puppet] - 10https://gerrit.wikimedia.org/r/1226243 (owner: 10Majavah) [13:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:25:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:25:21] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1335.eqiad.wmnet with OS trixie [13:25:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1335.eqiad.wmnet with OS trixie completed: - wikikube... [13:25:49] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:26:06] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:26:07] 06SRE, 10TimedMediaHandler-Transcode: Increase capacity for Mercurius webvideoTranscode job (1080p) processing - https://phabricator.wikimedia.org/T414427#11516565 (10TheDJ) [13:26:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1336.eqiad.wmnet with OS trixie [13:26:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1336.eqiad.wmnet with OS trixie completed: - wikikube... [13:26:47] (03CR) 10Elukey: [C:03+1] Release v0.11.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1225671 (owner: 10Ayounsi) [13:27:00] (03CR) 10Elukey: [C:03+1] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226218 (owner: 10Muehlenhoff) [13:27:57] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:28:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:28:16] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1339.eqiad.wmnet with OS trixie [13:28:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1339.eqiad.wmnet with OS trixie completed: - wikikube... [13:28:30] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226239 (https://phabricator.wikimedia.org/T396864) (owner: 10Muehlenhoff) [13:28:58] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11516572 (10Dzahn) a:03Dzahn @ATitkov @cmadeo Hi! It seems to me all is done here and ready to go live on Thursday. We have tested and the... [13:29:02] (03CR) 10Ayounsi: [C:03+2] Release v0.11.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1225671 (owner: 10Ayounsi) [13:29:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1337.eqiad.wmnet with reason: host reimage [13:31:04] (03CR) 10Dzahn: [C:03+2] "that fixed it:)) thanks, Jelto" [puppet] - 10https://gerrit.wikimedia.org/r/1224901 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [13:32:16] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1342.eqiad.wmnet with OS trixie [13:32:18] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from collab roles [puppet] - 10https://gerrit.wikimedia.org/r/1226175 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:32:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516581 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1342.eqiad.wmnet with OS trixie [13:32:32] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1343.eqiad.wmnet with OS trixie [13:32:34] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1344.eqiad.wmnet with OS trixie [13:32:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516583 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1343.eqiad.wmnet with OS trixie [13:32:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516584 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1344.eqiad.wmnet with OS trixie [13:33:52] (03CR) 10Ayounsi: [C:03+2] ganeti/magru: Switch to dnsmasq as the DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1226239 (https://phabricator.wikimedia.org/T396864) (owner: 10Muehlenhoff) [13:34:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516603 (10Jclark-ctr) [13:34:23] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:34:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:34:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1340.eqiad.wmnet with OS trixie [13:34:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516605 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1340.eqiad.wmnet with OS trixie completed: - wikikube... [13:35:08] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1345.eqiad.wmnet with OS trixie [13:35:16] !log ayounsi@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: Release v0.11.1 - ayounsi@cumin1003 [13:35:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1345.eqiad.wmnet with OS trixie [13:35:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225613 (https://phabricator.wikimedia.org/T408915) (owner: 10Jgiannelos) [13:36:05] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: Release v0.11.1 - ayounsi@cumin1003 [13:37:00] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:37:44] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:37:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1341.eqiad.wmnet with OS trixie [13:37:47] (03PS2) 10Blake: datacenter: remove unused EXCLUDED_SERVICES constant. [cookbooks] - 10https://gerrit.wikimedia.org/r/1226211 (https://phabricator.wikimedia.org/T412211) [13:37:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516627 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1341.eqiad.wmnet with OS trixie completed: - wikikube... [13:38:14] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1346.eqiad.wmnet with OS trixie [13:38:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516629 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1346.eqiad.wmnet with OS trixie [13:39:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516632 (10Jclark-ctr) [13:42:26] !log ayounsi@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin1003.eqiad.wmnet with reason: Release v0.11.1 - ayounsi@cumin1003 [13:43:16] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin1003.eqiad.wmnet with reason: Release v0.11.1 - ayounsi@cumin1003 [13:43:28] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1342.eqiad.wmnet with reason: host reimage [13:43:29] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1343.eqiad.wmnet with reason: host reimage [13:43:38] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1344.eqiad.wmnet with reason: host reimage [13:46:04] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:46:17] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1345.eqiad.wmnet with reason: host reimage [13:46:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:46:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1337.eqiad.wmnet with OS trixie [13:47:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516653 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1337.eqiad.wmnet with OS trixie completed: - wikikube... [13:49:13] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1342.eqiad.wmnet with reason: host reimage [13:49:30] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1346.eqiad.wmnet with reason: host reimage [13:51:41] (03PS1) 10Btullis: Only install the JRE instead of the JDK on druid-test hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1226251 (https://phabricator.wikimedia.org/T278056) [13:51:51] (03PS2) 10Btullis: Only install the JRE instead of the JDK on druid-test hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1226251 (https://phabricator.wikimedia.org/T278056) [13:52:23] (03PS13) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [13:52:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1343.eqiad.wmnet with reason: host reimage [13:52:51] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7871/co" [puppet] - 10https://gerrit.wikimedia.org/r/1226251 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [13:53:36] (03CR) 10Federico Ceratto: "I added support for the `x*` sections and tests and did some minor cleanup." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [13:54:55] (03CR) 10Dzahn: [C:03+1] eventgate-analytics-external: add wikipedia25.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [13:55:04] (03CR) 10Btullis: Only install the JRE instead of the JDK on druid-test hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1226251 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [13:55:17] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru and not P{cp7008.*} and A:cp - haproxy 2.8.18 upgrade (T414318) [13:55:20] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [13:55:44] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru and not P{cp7016.*} and A:cp - haproxy 2.8.18 upgrade (T414318) [13:56:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1344.eqiad.wmnet with reason: host reimage [13:56:50] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:57:27] (03CR) 10CI reject: [V:04-1] sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [13:57:29] (03PS1) 10Muehlenhoff: dnsmasq: Require package to be installed [puppet] - 10https://gerrit.wikimedia.org/r/1226254 [13:58:50] (03CR) 10Brouberol: [C:03+1] Only install the JRE instead of the JDK on druid-test hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1226251 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [13:59:25] (03PS3) 10Muehlenhoff: nftables: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219877 [13:59:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11516691 (10Jclark-ctr) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T1400). [14:00:05] nemo-yiannis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:18] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt mwlog1003 - jclark@cumin1003" [14:00:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt mwlog1003 - jclark@cumin1003" [14:00:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:00:25] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1345.eqiad.wmnet with reason: host reimage [14:01:00] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host mwlog1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:02:39] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for hmonroy - https://phabricator.wikimedia.org/T414375#11516698 (10JMeybohm) Your account is already a member of the group (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/m... [14:03:10] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11516699 (10JMeybohm) [14:03:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1346.eqiad.wmnet with reason: host reimage [14:05:14] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:05:21] 👋 [14:05:30] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:05:31] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1342.eqiad.wmnet with OS trixie [14:05:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1342.eqiad.wmnet with OS trixie completed: - wikikube... [14:06:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516715 (10Jclark-ctr) [14:06:21] I can deploy [14:06:24] (03CR) 10Zabe: [C:03+2] ProofreadPage: Disable flag to render using parsoid temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225613 (https://phabricator.wikimedia.org/T408915) (owner: 10Jgiannelos) [14:06:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:07:16] (03Merged) 10jenkins-bot: ProofreadPage: Disable flag to render using parsoid temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225613 (https://phabricator.wikimedia.org/T408915) (owner: 10Jgiannelos) [14:07:58] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:08:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:08:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1343.eqiad.wmnet with OS trixie [14:08:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516719 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1343.eqiad.wmnet with OS trixie completed: - wikikube... [14:08:23] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1225613|ProofreadPage: Disable flag to render using parsoid temporarily (T408915)]] [14:08:27] T408915: visualdiff testing: Escaped link elements are shown in page view on fr.wikisource.org - https://phabricator.wikimedia.org/T408915 [14:08:35] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm7001.magru.wmnet with OS bookworm [14:09:42] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11516721 (10JMeybohm) [14:10:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516724 (10Jclark-ctr) [14:10:21] (03PS1) 10Santiago Faci: Replaced mpic-next.wikimedia.org with test-kitchen-next.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226257 (https://phabricator.wikimedia.org/T407805) [14:10:31] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mwlog1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:10:58] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11516728 (10ATitkov) > With some moderate additional effort I could reduce that "soft launch" time window to like 5 minutes. But please let m... [14:11:36] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11516729 (10JMeybohm) >>! In T413364#11514379, @KReid-WMF wrote: > Hi @Dzahn - the experimentation platform dashboards use private data, and as such I'll n... [14:11:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:12:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11516732 (10Jclark-ctr) This server is ready to be imaged pending @herron updating puppet [14:12:42] !log zabe@deploy2002 jgiannelos, zabe: Backport for [[gerrit:1225613|ProofreadPage: Disable flag to render using parsoid temporarily (T408915)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:13:03] nemo-yiannis: can you test? [14:13:15] checking [14:13:16] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:13:32] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:13:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1344.eqiad.wmnet with OS trixie [14:13:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516735 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1344.eqiad.wmnet with OS trixie completed: - wikikube... [14:14:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516736 (10Jclark-ctr) [14:14:11] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:16:00] ok zabe it works, thanks! [14:16:02] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1338.eqiad.wmnet with OS trixie [14:16:03] o/ [14:16:10] !log zabe@deploy2002 jgiannelos, zabe: Continuing with sync [14:16:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516739 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1338.eqiad.wmnet with OS trixie [14:16:20] thanks zabe :) [14:16:48] yw:) [14:16:55] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:17:11] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:17:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1345.eqiad.wmnet with OS trixie [14:17:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516740 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1345.eqiad.wmnet with OS trixie completed: - wikikube... [14:18:05] (03CR) 10Ssingh: "hieradata/role/common/hcaptcha/proxy.yaml and I think hieradata/role/common/insetup_noferm.yaml can also be updated." [puppet] - 10https://gerrit.wikimedia.org/r/1225524 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:18:10] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1349.eqiad.wmnet with OS trixie [14:18:14] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1350.eqiad.wmnet with OS trixie [14:18:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516741 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1349.eqiad.wmnet with OS trixie [14:18:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516742 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1350.eqiad.wmnet with OS trixie [14:18:33] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1351.eqiad.wmnet with OS trixie [14:18:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516746 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1351.eqiad.wmnet with OS trixie [14:20:11] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:21:30] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1347.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:21:31] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1348.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:21:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:21:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1346.eqiad.wmnet with OS trixie [14:21:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516753 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1346.eqiad.wmnet with OS trixie completed: - wikikube... [14:22:01] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1352.eqiad.wmnet with OS trixie [14:22:07] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225613|ProofreadPage: Disable flag to render using parsoid temporarily (T408915)]] (duration: 13m 44s) [14:22:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516754 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1352.eqiad.wmnet with OS trixie [14:22:12] T408915: visualdiff testing: Escaped link elements are shown in page view on fr.wikisource.org - https://phabricator.wikimedia.org/T408915 [14:24:10] (03CR) 10Zabe: [C:03+2] Close kywikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225034 (https://phabricator.wikimedia.org/T413845) (owner: 10Zabe) [14:25:07] (03Merged) 10jenkins-bot: Close kywikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225034 (https://phabricator.wikimedia.org/T413845) (owner: 10Zabe) [14:25:49] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1225034|Close kywikibooks (T413845)]] [14:25:53] T413845: Close kywikibooks - https://phabricator.wikimedia.org/T413845 [14:27:29] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1338.eqiad.wmnet with reason: host reimage [14:27:45] 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11516796 (10JMeybohm) [14:28:02] !log zabe@deploy2002 zabe: Backport for [[gerrit:1225034|Close kywikibooks (T413845)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:28:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1347.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:28:52] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1348.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:29:08] !log zabe@deploy2002 zabe: Continuing with sync [14:29:15] 06SRE, 10TimedMediaHandler-Transcode, 10ServiceOps new: Increase capacity for Mercurius webvideoTranscode job (1080p) processing - https://phabricator.wikimedia.org/T414427#11516808 (10JMeybohm) p:05Triage→03Medium [14:29:29] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1350.eqiad.wmnet with reason: host reimage [14:29:29] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1348.eqiad.wmnet with OS trixie [14:29:31] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1347.eqiad.wmnet with OS trixie [14:29:33] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1349.eqiad.wmnet with reason: host reimage [14:29:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516810 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1348.eqiad.wmnet with OS trixie [14:29:40] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1351.eqiad.wmnet with reason: host reimage [14:29:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1347.eqiad.wmnet with OS trixie [14:33:01] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1352.eqiad.wmnet with reason: host reimage [14:33:05] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225034|Close kywikibooks (T413845)]] (duration: 07m 16s) [14:33:09] T413845: Close kywikibooks - https://phabricator.wikimedia.org/T413845 [14:34:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1338.eqiad.wmnet with reason: host reimage [14:38:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.newdepool depool db2196: Schema change [14:38:10] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1350.eqiad.wmnet with reason: host reimage [14:38:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) depool db2196: Schema change [14:39:13] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2196: Schema change [14:39:24] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru and not P{cp7016.*} and A:cp - haproxy 2.8.18 upgrade (T414318) [14:39:24] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm7001.magru.wmnet with reason: host reimage [14:39:28] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [14:40:26] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1348.eqiad.wmnet with reason: host reimage [14:40:27] (03PS1) 10Btullis: Update the image used for the spark-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226261 (https://phabricator.wikimedia.org/T410017) [14:40:28] (03CR) 10Marostegui: sre.mysql.newpool: [de]pool various section kinds (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [14:40:38] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1347.eqiad.wmnet with reason: host reimage [14:41:19] (03PS1) 10JMeybohm: admin/data: Add shell and analytics-privatedata-users access for trueg [puppet] - 10https://gerrit.wikimedia.org/r/1226262 (https://phabricator.wikimedia.org/T414192) [14:41:53] (03CR) 10JMeybohm: [C:04-2] "Out of band verification of the SSH key is outstanding" [puppet] - 10https://gerrit.wikimedia.org/r/1226262 (https://phabricator.wikimedia.org/T414192) (owner: 10JMeybohm) [14:41:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1349.eqiad.wmnet with reason: host reimage [14:42:28] 06SRE, 10MediaWiki-Debug-Logger, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11516901 (10kostajh) > 1) passing the relevant headers through to MediaWiki Who from #SRE co... [14:42:40] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru and not P{cp7008.*} and A:cp - haproxy 2.8.18 upgrade (T414318) [14:43:46] (03CR) 10Ayounsi: [C:03+2] Capirca: only show diff when running in "non-commit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218209 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [14:43:47] (03PS2) 10Btullis: Update the image used for the spark-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226261 (https://phabricator.wikimedia.org/T410017) [14:44:26] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [14:44:56] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:45:36] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm7001.magru.wmnet with reason: host reimage [14:45:53] !log jclark@cumin1003 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on wikikube-worker1347.eqiad.wmnet with reason: host reimage [14:48:03] (03Merged) 10jenkins-bot: Capirca: only show diff when running in "non-commit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218209 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [14:48:05] (03PS1) 10Zabe: Stop setting import source for crwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226266 (https://phabricator.wikimedia.org/T411501) [14:48:34] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11516939 (10JMeybohm) The kerberos principal has been created. For off band verification of the SSH key, please confirm the key by putting it onto your ([[ https://www.me... [14:48:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516944 (10Jclark-ctr) [14:49:25] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1351.eqiad.wmnet with reason: host reimage [14:49:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516945 (10Jclark-ctr) [14:50:11] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:50:22] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [14:50:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:52:06] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1348.eqiad.wmnet with reason: host reimage [14:52:29] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:52:30] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1338.eqiad.wmnet with OS trixie [14:52:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1338.eqiad.wmnet with OS trixie completed: - wikikube... [14:53:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516969 (10Jclark-ctr) [14:54:18] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:55:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:55:21] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1350.eqiad.wmnet with OS trixie [14:55:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1350.eqiad.wmnet with OS trixie completed: - wikikube... [14:55:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11516973 (10Jclark-ctr) [14:55:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1352.eqiad.wmnet with reason: host reimage [14:55:55] (03CR) 10Brouberol: [C:03+1] Update the image used for the spark-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226261 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:57:34] (03PS1) 10Brouberol: druid_exporter: duplicate config from druid 0.19.0 [puppet] - 10https://gerrit.wikimedia.org/r/1226270 (https://phabricator.wikimedia.org/T278056) [14:58:49] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:59:52] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:59:53] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1349.eqiad.wmnet with OS trixie [15:00:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1349.eqiad.wmnet with OS trixie completed: - wikikube... [15:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T1500) [15:00:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517011 (10Jclark-ctr) [15:02:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11517016 (10Jclark-ctr) a:03Jclark-ctr [15:02:41] (03CR) 10JMeybohm: [C:03+1] "Key has now been verified." [puppet] - 10https://gerrit.wikimedia.org/r/1226262 (https://phabricator.wikimedia.org/T414192) (owner: 10JMeybohm) [15:03:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm7001.magru.wmnet with OS bookworm [15:04:52] (03PS1) 10Eevans: remove obsolete (pre-cfssl) sessionstore certificate [puppet] - 10https://gerrit.wikimedia.org/r/1226272 [15:06:20] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:06:44] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:06:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1351.eqiad.wmnet with OS trixie [15:06:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517037 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1351.eqiad.wmnet with OS trixie completed: - wikikube... [15:07:02] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1226262 (https://phabricator.wikimedia.org/T414192) (owner: 10JMeybohm) [15:07:31] (03PS1) 10Ejegg: Revert "Shorten 'close' cookie wait period for enwiki banners" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226275 (https://phabricator.wikimedia.org/T411800) [15:08:18] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:46] (03CR) 10Ayounsi: [C:03+1] dnsmasq: Require package to be installed [puppet] - 10https://gerrit.wikimedia.org/r/1226254 (owner: 10Muehlenhoff) [15:10:06] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo and A:cp - haproxy 2.8.18 upgrade (T414318) [15:10:10] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [15:10:17] !log vgutierrez@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo and A:cp - haproxy 2.8.18 upgrade (T414318) [15:10:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:10:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1348.eqiad.wmnet with OS trixie [15:10:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1348.eqiad.wmnet with OS trixie completed: - wikikube... [15:11:25] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1359.eqiad.wmnet with OS bookworm [15:11:26] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1358.eqiad.wmnet with OS bookworm [15:11:26] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1357.eqiad.wmnet with OS bookworm [15:11:29] !log revoked legacy restbase discovery certificate T365798 [15:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:34] T365798: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798 [15:11:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517049 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1359.eqiad.wmnet with OS bookworm [15:11:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517050 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1358.eqiad.wmnet with OS bookworm [15:11:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517051 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1357.eqiad.wmnet with OS bookworm [15:12:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517053 (10Jclark-ctr) [15:13:04] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:13:14] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:29] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:13:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:13:30] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1352.eqiad.wmnet with OS trixie [15:13:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1352.eqiad.wmnet with OS trixie completed: - wikikube... [15:14:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11517056 (10cmooney) @Jclark-ctr I went to do this but it turns out we need to disconnect all the switch - switch links before the de... [15:15:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517059 (10Jclark-ctr) [15:15:32] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploying v1.1.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226276 (https://phabricator.wikimedia.org/T407808) [15:16:45] (03CR) 10Kareid: [C:03+1] Test Kitchen UI: Deploying v1.1.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226276 (https://phabricator.wikimedia.org/T407808) (owner: 10Santiago Faci) [15:16:49] !log Deploy schema change on x1 master to fix T414474 [15:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:53] T414474: growthexperiments_mentor_mentee.gemm_mentee_is_active is not present on some Wikipedias - https://phabricator.wikimedia.org/T414474 [15:17:02] (03CR) 10JMeybohm: [C:03+2] admin/data: Add shell and analytics-privatedata-users access for trueg [puppet] - 10https://gerrit.wikimedia.org/r/1226262 (https://phabricator.wikimedia.org/T414192) (owner: 10JMeybohm) [15:17:49] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1354.eqiad.wmnet with OS bookworm [15:18:03] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1355.eqiad.wmnet with OS bookworm [15:18:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1354.eqiad.wmnet with OS bookworm [15:18:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517084 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1355.eqiad.wmnet with OS bookworm [15:18:19] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploying v1.1.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226276 (https://phabricator.wikimedia.org/T407808) (owner: 10Santiago Faci) [15:18:19] (03PS6) 10Giuseppe Lavagetto: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) [15:18:37] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11517099 (10JMeybohm) [15:18:57] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1356.eqiad.wmnet with OS bookworm [15:19:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517100 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1356.eqiad.wmnet with OS bookworm [15:19:11] RESOLVED: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:19:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11517101 (10ssingh) >>! In T392851#11514052, @Jhancock.wm wrote: > @ssingh do you need assistance getting these reimaged? Thanks for the offer, @Jhancock.w... [15:19:35] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11517106 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Key has been verified and patch merged. You should have access after ~30min max. [15:19:53] !log upgrade durum* to Bird 2.18 T413740 [15:19:54] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploying v1.1.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226276 (https://phabricator.wikimedia.org/T407808) (owner: 10Santiago Faci) [15:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:57] T413740: Backport and test Bird 2.18 - https://phabricator.wikimedia.org/T413740 [15:20:08] (03CR) 10C. Scott Ananian: [C:04-1] Turn on debugging for unsafe postproc cache entries logging (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) (owner: 10Isabelle Hurbain-Palatin) [15:20:18] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:20:37] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:21:17] (03PS2) 10Clément Goubert: api-gateway: Update ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226209 (https://phabricator.wikimedia.org/T414002) [15:22:00] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1359.eqiad.wmnet with reason: host reimage [15:22:14] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:22:29] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1357.eqiad.wmnet with reason: host reimage [15:22:29] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1358.eqiad.wmnet with reason: host reimage [15:22:42] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:24:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db2196: Schema change [15:25:59] 06SRE, 06Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11517177 (10ssingh) Yes, thanks for the ping @Paladox. We should most certainly pick this up again. @BBlack: any fresh 2026 thoughts? You listed some concerns above but some of them don't apply anymore -- should we do... [15:26:44] (03CR) 10Btullis: [C:03+1] druid_exporter: duplicate config from druid 0.19.0 [puppet] - 10https://gerrit.wikimedia.org/r/1226270 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [15:27:37] (03CR) 10Btullis: [C:03+2] Update the image used for the spark-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226261 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [15:27:49] 06SRE, 10MediaWiki-Debug-Logger, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11517188 (10ssingh) >>! In T412396#11516901, @kostajh wrote: >> 1) passing the relevant heade... [15:28:18] (03CR) 10Clément Goubert: api-gateway: Update ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226209 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert) [15:28:31] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1354.eqiad.wmnet with reason: host reimage [15:28:42] (03CR) 10Clément Goubert: [V:03+2] api-gateway: Update ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226209 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert) [15:28:43] (03CR) 10Clément Goubert: [V:03+2 C:03+2] api-gateway: Update ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226209 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert) [15:29:32] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1356.eqiad.wmnet with reason: host reimage [15:29:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1359.eqiad.wmnet with reason: host reimage [15:29:41] (03Merged) 10jenkins-bot: Update the image used for the spark-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226261 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [15:29:55] 06SRE, 06Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11517195 (10ssingh) I should mention that `ns[01]` v6 will be unicast, like v4, and `ns2` will be anycast v6, just like the v4 one. But these are minor operational details, the real question is if we are ready to do th... [15:29:58] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1347.eqiad.wmnet with OS bookworm [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T1530) [15:30:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517198 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1347.eqiad.wmnet with OS bookworm [15:30:52] (03Merged) 10jenkins-bot: api-gateway: Update ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226209 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert) [15:32:15] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploying v1.1.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226281 (https://phabricator.wikimedia.org/T407808) [15:32:18] (03PS1) 10Bking: WIP: Alert DPE SRE when probes fail in dse-k8s clusters [alerts] - 10https://gerrit.wikimedia.org/r/1226282 (https://phabricator.wikimedia.org/T412447) [15:32:38] (03PS2) 10Santiago Faci: Test Kitchen UI: Deploying v1.1.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226281 (https://phabricator.wikimedia.org/T407808) [15:33:31] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploying v1.1.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226281 (https://phabricator.wikimedia.org/T407808) (owner: 10Santiago Faci) [15:33:41] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1354.eqiad.wmnet with reason: host reimage [15:33:46] (03CR) 10Isabelle Hurbain-Palatin: Turn on debugging for unsafe postproc cache entries logging (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) (owner: 10Isabelle Hurbain-Palatin) [15:33:55] (03CR) 10CI reject: [V:04-1] WIP: Alert DPE SRE when probes fail in dse-k8s clusters [alerts] - 10https://gerrit.wikimedia.org/r/1226282 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [15:34:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:17] (03PS2) 10Clément Goubert: ratelimit: Update ratelimit service version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226208 (https://phabricator.wikimedia.org/T414002) [15:34:17] (03PS1) 10Clément Goubert: api-gateway: Bump staging ratelimit version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226283 (https://phabricator.wikimedia.org/T414002) [15:34:38] (03CR) 10Muehlenhoff: [C:03+2] dnsmasq: Require package to be installed [puppet] - 10https://gerrit.wikimedia.org/r/1226254 (owner: 10Muehlenhoff) [15:34:42] (03PS2) 10Clément Goubert: api-gateway: Bump staging ratelimit version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226283 (https://phabricator.wikimedia.org/T414002) [15:34:47] (03PS4) 10Isabelle Hurbain-Palatin: Turn on debugging for unsafe postproc cache entries logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) [15:35:01] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [15:35:44] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploying v1.1.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226281 (https://phabricator.wikimedia.org/T407808) (owner: 10Santiago Faci) [15:36:11] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [15:36:34] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [15:36:45] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [15:37:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1357.eqiad.wmnet with reason: host reimage [15:37:07] (03CR) 10Muehlenhoff: [C:03+2] ganeti/esams: Switch to dnsmasq as the DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1226240 (https://phabricator.wikimedia.org/T396864) (owner: 10Muehlenhoff) [15:41:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1356.eqiad.wmnet with reason: host reimage [15:41:22] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1347.eqiad.wmnet with reason: host reimage [15:41:57] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Bump staging ratelimit version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226283 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert) [15:42:22] (03CR) 10Btullis: [C:03+2] Only install the JRE instead of the JDK on druid-test hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1226251 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [15:43:46] (03Merged) 10jenkins-bot: api-gateway: Bump staging ratelimit version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226283 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert) [15:46:21] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11517274 (10Xqt) @Fabfur: I can’t reproduce this issue locally, but it still occurs in the Pywikibot tests, though less frequently, see https://github.com/wikimedi... [15:46:27] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:46:44] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:46:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1359.eqiad.wmnet with OS bookworm [15:46:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517282 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1359.eqiad.wmnet with OS bookworm completed: - wikiku... [15:47:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517284 (10Jclark-ctr) [15:48:32] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1347.eqiad.wmnet with reason: host reimage [15:50:01] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11517311 (10elukey) Tested the removal of Docker images via HTTP DELETE API, followed by `sudo /usr/bin/docker-registry garbage-collect -m /etc/docker/registry/conf... [15:50:13] 06SRE: New SRE manager - Get emails sent to noc - https://phabricator.wikimedia.org/T414223#11517314 (10MLechvien-WMF) 05Open→03Resolved Confirming I'm now receiving the emails. Thanks! [15:50:42] (03PS1) 10Muehlenhoff: Unconditionally use dnsmasq on routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1226285 (https://phabricator.wikimedia.org/T396864) [15:51:04] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:51:25] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:51:26] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1354.eqiad.wmnet with OS bookworm [15:51:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1354.eqiad.wmnet with OS bookworm completed: - wikiku... [15:51:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517320 (10Jclark-ctr) [15:52:16] 06SRE, 06Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11517321 (10BBlack) We have to take this plunge someday, and that someday probably should've been years ago, just too many other pressing things to focus on for anyone to remember to come back here and look! A few no... [15:53:43] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1358.eqiad.wmnet with reason: host reimage [15:53:57] (03CR) 10Clare Ming: [C:03+1] Replaced mpic-next.wikimedia.org with test-kitchen-next.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226257 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [15:54:14] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:54:14] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo and A:cp - haproxy 2.8.18 upgrade (T414318) [15:54:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226257 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [15:54:19] T414318: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318 [15:55:31] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:55:32] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1357.eqiad.wmnet with OS bookworm [15:55:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517339 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1357.eqiad.wmnet with OS bookworm completed: - wikiku... [15:55:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517340 (10Jclark-ctr) [15:56:30] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:56:48] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:57:13] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo and A:cp - haproxy 2.8.18 upgrade (T414318) [15:57:35] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:57:47] 06SRE, 06Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11517352 (10BBlack) [In fact, on that point, I'd note a quick survey of a handful of other major sites on the Internet shows a common pattern of 2 days for the NS records and 2-4 days on the matching address records.... [15:57:57] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:57:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1356.eqiad.wmnet with OS bookworm [15:58:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1356.eqiad.wmnet with OS bookworm completed: - wikiku... [15:58:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517362 (10Jclark-ctr) [15:59:11] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [16:00:04] jelto, arnoldokoth, mutante, and arnaudb: How many deployers does it take to do SRE Collaboration Services office hours deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T1600). [16:00:20] (03PS2) 10Muehlenhoff: Unconditionally use dnsmasq on routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1226285 (https://phabricator.wikimedia.org/T396864) [16:00:36] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [16:00:38] (03PS1) 10Hnowlan: thumbor: reimplement SVG max size feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226286 (https://phabricator.wikimedia.org/T411076) [16:02:30] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [16:02:59] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [16:03:02] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11517391 (10RobH) cp5022 is unresponsive to ping on its primary interface (expected with OS down) and idrac/mgmt interface (unexpected). 1-255962774671 entered should be completed by 2026-01-15 @ 13:... [16:03:15] (03PS2) 10Milimetric: trafficserver: Send /evt-502b/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) [16:03:27] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on phab2002.codfw.wmnet,phab[1004-1005].eqiad.wmnet with reason: T414479 [16:03:29] (03CR) 10Clément Goubert: [C:03+2] ratelimit: Update ratelimit service version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226208 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert) [16:03:31] T414479: Deploy Phab/Phorge 2026-01-13 - https://phabricator.wikimedia.org/T414479 [16:03:54] !log brennen@deploy2002 Started deploy [phabricator/deployment@f12e2e1]: deploy phab2002 for T414479 [16:04:26] !log brennen@deploy2002 Finished deploy [phabricator/deployment@f12e2e1]: deploy phab2002 for T414479 (duration: 00m 31s) [16:04:54] !log brennen@deploy2002 Started deploy [phabricator/deployment@f12e2e1]: deploy phab1004 for T414479 [16:05:30] (03Merged) 10jenkins-bot: ratelimit: Update ratelimit service version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226208 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert) [16:05:30] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [16:05:48] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [16:05:54] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1347.eqiad.wmnet with OS bookworm [16:06:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517411 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1347.eqiad.wmnet with OS bookworm completed: - wikiku... [16:06:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517412 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1347.eqiad.wmnet with OS bookworm executed with error... [16:06:29] (03PS3) 10Muehlenhoff: Unconditionally use dnsmasq on routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1226285 (https://phabricator.wikimedia.org/T396864) [16:06:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517417 (10Jclark-ctr) [16:07:03] !log brennen@deploy2002 Finished deploy [phabricator/deployment@f12e2e1]: deploy phab1004 for T414479 (duration: 02m 09s) [16:07:31] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/ratelimit: apply [16:08:46] (03CR) 10Vgutierrez: "`profile::liberica::include_services` needs to be updated for lvs7001 and lvs7003" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [16:09:07] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [16:09:58] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [16:10:17] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [16:10:23] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/ratelimit: apply [16:10:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517473 (10Jclark-ctr) wikikube-worker1353 is complaining of cable disconnected wikikube-worker1355 is failing to image will need to investigate [16:10:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1226285 (https://phabricator.wikimedia.org/T396864) (owner: 10Muehlenhoff) [16:10:44] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [16:11:00] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [16:11:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1358.eqiad.wmnet with OS bookworm [16:11:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1358.eqiad.wmnet with OS bookworm completed: - wikiku... [16:11:33] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [16:11:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517499 (10Jclark-ctr) [16:11:57] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11517507 (10MoritzMuehlenhoff) [16:12:47] (03CR) 10Vgutierrez: "oh.. it's already there, forget it :)" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [16:13:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182693 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [16:13:49] (03CR) 10Ayounsi: [C:03+1] Unconditionally use dnsmasq on routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1226285 (https://phabricator.wikimedia.org/T396864) (owner: 10Muehlenhoff) [16:13:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:14:51] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: changed REST sandbox rerouting to redirection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224838 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [16:15:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:15:58] (03PS1) 10Clément Goubert: Revert^4 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1226288 (https://phabricator.wikimedia.org/T396807) [16:16:59] (03Merged) 10jenkins-bot: rest-gateway: changed REST sandbox rerouting to redirection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224838 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [16:18:10] (03CR) 10Muehlenhoff: [C:03+2] conf/codfw: Remove now obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1182693 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [16:20:18] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:20:32] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:20:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:21:13] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:25:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:26:19] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [16:27:15] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:27:29] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [16:27:51] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [16:28:04] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [16:28:41] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11517677 (10elukey) Little refresher about what we store on Redis in T375645#10217676. I checked in the eqiad redis cache (where I pushed the images) and I haven't... [16:28:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:29:27] (03PS1) 10Clément Goubert: rest-gateway: Fix RedirectResponseCode value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226291 (https://phabricator.wikimedia.org/T396807) [16:30:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:33:57] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Fix RedirectResponseCode value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226291 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [16:34:53] 06SRE, 10LDAP-Access-Requests: Grant Access to WMF(?) for HFanWMF - https://phabricator.wikimedia.org/T414492 (10HFan-WMF) 03NEW [16:35:46] (03Merged) 10jenkins-bot: rest-gateway: Fix RedirectResponseCode value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226291 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [16:37:23] 06SRE, 10SRE-Access-Requests: Grant Access to WMF(?) for HFanWMF - https://phabricator.wikimedia.org/T414492#11517764 (10Novem_Linguae) Sounds like you need analytics-privatedata-users level 1 [16:38:14] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1355.eqiad.wmnet with OS bookworm [16:38:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11517769 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1355.eqiad.wmnet with OS bookworm executed with error... [16:38:59] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:39:23] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [16:41:40] !log roll restart docker-registry-swift daemons on registry* to pick up the new settings (apparently the service refresh issued by puppet didn't work as intended) - T390251 [16:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:44] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [16:47:24] (03CR) 10Brouberol: [C:03+2] druid_exporter: duplicate config from druid 0.19.0 [puppet] - 10https://gerrit.wikimedia.org/r/1226270 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [16:47:33] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:48:17] RECOVERY - Host titan1002 is UP: PING WARNING - Packet loss = 90%, RTA = 0.38 ms [16:48:29] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:48:31] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:48:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:49:11] FIRING: ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:49:33] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:49:43] 10ops-eqiad, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker1338:9290 - https://phabricator.wikimedia.org/T414496 (10phaultfinder) 03NEW [16:50:07] FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:29] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:51:31] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:52:45] <_joe_> fabfur: are you looking into it? [16:52:57] <_joe_> the usual query of death? sorry I was in a call [16:52:59] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:53:05] <_joe_> ah recovery I see [16:53:21] yeah, might not be permanent though - we're looking at it [16:53:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:54:11] FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:54:11] FIRING: JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:54:34] Several things on fire right now. [16:55:07] RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:07] RESOLVED: JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:56:56] (03PS4) 10Jsn.sherman: InitialiseSettings.php: Add wmgUsePersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217787 (https://phabricator.wikimedia.org/T412528) [16:57:03] (03PS5) 10Jsn.sherman: InitialiseSettings-labs.php: Deploy PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217788 (https://phabricator.wikimedia.org/T412528) [16:57:08] (03PS5) 10Jsn.sherman: CommonSettings-labs: Load PersonalDashbard extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217789 (https://phabricator.wikimedia.org/T412528) [16:57:44] (03PS2) 10Ahmon Dancy: git::clone: Get default branch name a different way [puppet] - 10https://gerrit.wikimedia.org/r/1219907 (https://phabricator.wikimedia.org/T413193) [16:58:07] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219907 (https://phabricator.wikimedia.org/T413193) (owner: 10Ahmon Dancy) [16:58:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217787 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [16:59:20] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219907 (https://phabricator.wikimedia.org/T413193) (owner: 10Ahmon Dancy) [17:00:05] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T1700) [17:00:05] dancy and dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:17] o/ [17:00:31] o/ [17:00:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217787 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [17:01:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217788 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [17:01:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217789 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [17:01:37] (03CR) 10C. Scott Ananian: [C:03+1] Turn on debugging for unsafe postproc cache entries logging (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) (owner: 10Isabelle Hurbain-Palatin) [17:01:44] (03CR) 10JHathaway: [C:03+2] deployment-prep common.yaml: Update mediawiki_smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/1225620 (https://phabricator.wikimedia.org/T412975) (owner: 10Ahmon Dancy) [17:01:52] (03CR) 10JHathaway: [C:03+2] git::clone: Get default branch name a different way [puppet] - 10https://gerrit.wikimedia.org/r/1219907 (https://phabricator.wikimedia.org/T413193) (owner: 10Ahmon Dancy) [17:03:28] dancy: merged [17:03:34] Thanks! [17:03:41] thank you! [17:04:22] (03PS1) 10Majavah: python: Log to stdout/stderr only [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1226300 (https://phabricator.wikimedia.org/T401102) [17:04:24] (03PS1) 10Majavah: lighttpd: Log to stdout/stderr only [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1226301 (https://phabricator.wikimedia.org/T401102) [17:05:41] 10ops-eqsin, 06SRE: Inbound errors on interface cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) - https://phabricator.wikimedia.org/T405938#11517987 (10RobH) 05Open→03Resolved a:03RobH [17:05:52] 10ops-eqsin, 06SRE: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390147#11517989 (10RobH) 05Open→03Resolved a:03RobH [17:06:52] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: git::clone can fail to checkout its remote branch, leading to unrecoverable failure - https://phabricator.wikimedia.org/T413193#11517996 (10dancy) 05Open→03Resolved [17:06:57] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11517997 (10RobH) > IBX Question:Dear Customer,We have traced both power cables, they are both connected at port 30 of PS1 and PS2.We have also unplugged and plug back both power cables as instructed.... [17:13:18] (03CR) 10Ahmon Dancy: "Possibly needed for https://phabricator.wikimedia.org/T414504 too" [puppet] - 10https://gerrit.wikimedia.org/r/702326 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [17:14:18] (03CR) 10Scott French: [C:03+1] "🎉" [cookbooks] - 10https://gerrit.wikimedia.org/r/1226211 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [17:16:47] (03CR) 10Majavah: "As Andrew says above, a PCC run on all of eqiad1 should be the first step in figuring out what kind of impact this would have." [puppet] - 10https://gerrit.wikimedia.org/r/702326 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [17:18:28] (03CR) 10Jbond: "This will need review and someone from sre and wmcs to review and shepherd it." [puppet] - 10https://gerrit.wikimedia.org/r/702326 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [17:20:13] (03PS1) 10Btullis: Remove GC logging options that are incpatible with Java 17 on druid-test [puppet] - 10https://gerrit.wikimedia.org/r/1226302 (https://phabricator.wikimedia.org/T278056) [17:20:57] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7873/co" [puppet] - 10https://gerrit.wikimedia.org/r/1226302 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [17:21:17] (03PS2) 10Btullis: Remove GC logging options that are incompatible with Java 17 on druid-test [puppet] - 10https://gerrit.wikimedia.org/r/1226302 (https://phabricator.wikimedia.org/T278056) [17:22:00] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7874/co" [puppet] - 10https://gerrit.wikimedia.org/r/1226302 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [17:23:03] (03CR) 10Isabelle Hurbain-Palatin: Turn on debugging for unsafe postproc cache entries logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) (owner: 10Isabelle Hurbain-Palatin) [17:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:24:48] (03CR) 10Isabelle Hurbain-Palatin: Turn on debugging for unsafe postproc cache entries logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) (owner: 10Isabelle Hurbain-Palatin) [17:26:14] (03PS5) 10Isabelle Hurbain-Palatin: Turn on debugging for unsafe postproc cache entries logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226232 (https://phabricator.wikimedia.org/T412803) [17:31:42] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11518193 (10elukey) Tried to push another image, and I got stuck two times in this very late state: ` elukey@build2001:~$ sudo docker push registry1004.eqiad.wmnet... [17:55:56] (03PS1) 10Hnowlan: thanos: set performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1226311 [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T1800) [18:00:13] (03CR) 10Herron: [C:03+1] thanos: set performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1226311 (owner: 10Hnowlan) [18:02:02] (03PS2) 10Bking: WIP: Alert DPE SRE when probes fail in dse-k8s clusters [alerts] - 10https://gerrit.wikimedia.org/r/1226282 (https://phabricator.wikimedia.org/T412447) [18:04:59] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11518459 (10calbon) I approve, I would Katherine to have shell access [18:05:35] (03PS7) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [18:06:29] (03CR) 10Btullis: [V:03+1 C:03+2] Remove GC logging options that are incompatible with Java 17 on druid-test [puppet] - 10https://gerrit.wikimedia.org/r/1226302 (https://phabricator.wikimedia.org/T278056) (owner: 10Btullis) [18:06:49] (03PS8) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [18:07:41] 06SRE, 10Scap, 06serviceops, 07Datacenter-Switchover: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11518462 (10dancy) > Potential ideas: > [] Drop a lock file on the deployment server that scap detects, remove it at a later step > [] Add a switch... [18:11:33] (03PS9) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [18:19:12] (03PS10) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [18:24:39] (03CR) 10Vgutierrez: "PS10 fixes VCL syntax for hosts where flags are disabled" [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [18:26:56] (03PS1) 10Clare Ming: Add Test Kitchen maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/1226318 (https://phabricator.wikimedia.org/T407806) [18:33:17] 06SRE, 06Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11518537 (10ssingh) >>! In T81605#11517321, @BBlack wrote: > We have to take this plunge someday, and that someday probably should've been years ago, just too many other pressing things to focus on for anyone to rememb... [18:36:32] (03PS2) 10Clare Ming: Add Test Kitchen maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/1226318 (https://phabricator.wikimedia.org/T407806) [18:36:48] 06SRE, 06Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11518553 (10ssingh) Our glue records also have a disparity. ` dig wikimedia.org NS +trace +additional ns2.wikimedia.org. 3600 IN A 198.35.27.27 ns1.wikimedia.org. 3600 IN A 208.80.153.231 ns0.wikimedia.org. 3600 IN A... [18:37:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:39:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:44:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:45:04] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker1338:9290 - https://phabricator.wikimedia.org/T414496#11518582 (10Jclark-ctr) a:03Jclark-ctr [18:46:36] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker1338:9290 - https://phabricator.wikimedia.org/T414496#11518589 (10Jclark-ctr) still working on setting up these servers T408752 [18:49:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:52:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:00:05] jeena and dduvall: Time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T1900). [19:09:36] I will start the train shortly [19:13:17] (03CR) 10Jeena Huneidi: [C:03+2] extension-list: add a bogus extension to test l10n-update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225023 (https://phabricator.wikimedia.org/T411516) (owner: 10BryanDavis) [19:14:05] (03Merged) 10jenkins-bot: extension-list: add a bogus extension to test l10n-update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225023 (https://phabricator.wikimedia.org/T411516) (owner: 10BryanDavis) [19:15:39] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226326 (https://phabricator.wikimedia.org/T413802) [19:15:42] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226326 (https://phabricator.wikimedia.org/T413802) (owner: 10TrainBranchBot) [19:16:41] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226326 (https://phabricator.wikimedia.org/T413802) (owner: 10TrainBranchBot) [19:21:19] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1226272 (owner: 10Eevans) [19:22:48] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.11 refs T413802 [19:22:52] T413802: 1.46.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T413802 [19:25:38] (03CR) 10Eevans: [C:03+2] remove obsolete (pre-cfssl) sessionstore certificate [puppet] - 10https://gerrit.wikimedia.org/r/1226272 (owner: 10Eevans) [19:40:49] !log jhuneidi@deploy2002 Started scap sync-world: test sync for T411516 [19:40:53] T411516: Add ability to ignore missing extensions in mergeMessageFileList's `--list-file` input - https://phabricator.wikimedia.org/T411516 [19:42:35] (03PS1) 10Pppery: Urwikiquote: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226329 (https://phabricator.wikimedia.org/T413592) [19:43:27] !log jhuneidi@deploy2002 Finished scap sync-world: test sync for T411516 (duration: 02m 38s) [19:47:35] (03PS1) 10Pppery: Siwiki: Add MOS namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226331 (https://phabricator.wikimedia.org/T414159) [19:50:02] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11518819 (10Dwisehaupt) 05Open→03Resolved a:03Dwisehaupt Host is built out and ready to be put into service. [19:51:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226331 (https://phabricator.wikimedia.org/T414159) (owner: 10Pppery) [19:51:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226329 (https://phabricator.wikimedia.org/T413592) (owner: 10Pppery) [20:16:04] 06SRE, 13Patch-For-Review, 10ServiceOps new: New WMF docker registry credentials - https://phabricator.wikimedia.org/T412524#11518931 (10Scott_French) [20:16:35] 06SRE, 07Kubernetes, 13Patch-For-Review, 10ServiceOps new: New WMF docker registry credentials - https://phabricator.wikimedia.org/T412524#11518934 (10Scott_French) [20:27:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11518969 (10cmooney) So to put more shape on this the two new racks for Fundraising are as follows: ######Racks * `E15` ** Replacement for rack C1, the switches from rack C1 an... [20:27:47] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 10%, RTA = 4798.19 ms [20:27:59] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [20:29:19] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker1338:9290 - https://phabricator.wikimedia.org/T414496#11518970 (10Jclark-ctr) 05Open→03Resolved [20:30:06] 06SRE, 06Release-Engineering-Team, 10Scap, 06serviceops, and 2 others: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11518971 (10dancy) [20:44:01] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [20:47:25] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1353.eqiad.wmnet with OS bookworm [20:47:26] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1355.eqiad.wmnet with OS bookworm [20:47:30] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt WIKIKUBE - jclark@cumin1003" [20:47:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519016 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1353.eqiad.wmnet with OS bookworm [20:47:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519017 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1355.eqiad.wmnet with OS bookworm [20:47:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt WIKIKUBE - jclark@cumin1003" [20:47:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:48:15] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1352 [20:48:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1352 [20:48:28] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1353 [20:48:40] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1352.eqiad.wmnet with OS bookworm [20:48:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1353 [20:48:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519020 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1352.eqiad.wmnet with OS bookworm [20:48:54] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1355 [20:49:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1355 [20:53:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87480 and previous config saved to /var/cache/conftool/dbconfig/20260113-205308-marostegui.json [20:53:16] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:53:16] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:58:03] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1355.eqiad.wmnet with reason: host reimage [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T2100). [21:00:05] danisztls, AaronSchulz, cjming, and Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:29] Here [21:00:41] mine can go out alongside others [21:01:32] o/ [21:01:41] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1353.eqiad.wmnet with OS bookworm [21:01:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1353.eqiad.wmnet with OS bookworm executed with error... [21:02:01] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1352.eqiad.wmnet with OS bookworm [21:02:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1352.eqiad.wmnet with OS bookworm executed with error... [21:02:17] AaronSchulz: i can do your patch with mine [21:02:24] danisztls: are you here? [21:02:27] thanks [21:02:47] Pppery: i can do yours if needed too [21:03:01] ok [21:03:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P87481 and previous config saved to /var/cache/conftool/dbconfig/20260113-210317-marostegui.json [21:03:48] since it looks like danisztls isn't here yet, i'll proceed with Aaron's and my config patches [21:04:06] (03PS2) 10Aaron Schulz: Update description of the Math API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223261 (https://phabricator.wikimedia.org/T411517) [21:04:14] (03PS2) 10Santiago Faci: Replaced mpic-next.wikimedia.org with test-kitchen-next.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226257 (https://phabricator.wikimedia.org/T407805) [21:04:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1355.eqiad.wmnet with reason: host reimage [21:05:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223261 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [21:05:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226257 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [21:06:19] (03Merged) 10jenkins-bot: Update description of the Math API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223261 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [21:06:23] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11519050 (10VRiley-WMF) Recieved part. Can we get a time to replace this drive @BTullis for an-worker1200 [21:06:24] (03Merged) 10jenkins-bot: Replaced mpic-next.wikimedia.org with test-kitchen-next.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226257 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [21:06:56] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1223261|Update description of the Math API (T411517)]], [[gerrit:1226257|Replaced mpic-next.wikimedia.org with test-kitchen-next.wikimedia.org (T407805)]] [21:07:01] T411517: Clean up Math API OpenAPI specs and remove data-parsoid route specs - https://phabricator.wikimedia.org/T411517 [21:07:02] T407805: Rename mpic.wikimedia.org - https://phabricator.wikimedia.org/T407805 [21:08:12] Sorry, I'm late. [21:09:05] !log cjming@deploy2002 aaron, cjming, sfaci: Backport for [[gerrit:1223261|Update description of the Math API (T411517)]], [[gerrit:1226257|Replaced mpic-next.wikimedia.org with test-kitchen-next.wikimedia.org (T407805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:09:35] danisztls: no worries - you can next - do you need a deployer? [21:09:44] *you can go next [21:10:13] AaronSchulz: good to sync? [21:11:42] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1353.eqiad.wmnet with OS bookworm [21:11:46] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1352.eqiad.wmnet with OS bookworm [21:11:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519065 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1353.eqiad.wmnet with OS bookworm [21:11:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519066 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1352.eqiad.wmnet with OS bookworm [21:12:01] cjming: sure [21:12:09] !log cjming@deploy2002 aaron, cjming, sfaci: Continuing with sync [21:12:51] cjming: I can deploy [21:13:00] (03PS1) 10Zabe: Start writing to il_target_id on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226350 (https://phabricator.wikimedia.org/T413526) [21:13:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P87482 and previous config saved to /var/cache/conftool/dbconfig/20260113-211325-marostegui.json [21:16:14] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223261|Update description of the Math API (T411517)]], [[gerrit:1226257|Replaced mpic-next.wikimedia.org with test-kitchen-next.wikimedia.org (T407805)]] (duration: 09m 18s) [21:16:19] AaronSchulz: should be live :) [21:16:19] T411517: Clean up Math API OpenAPI specs and remove data-parsoid route specs - https://phabricator.wikimedia.org/T411517 [21:16:20] T407805: Rename mpic.wikimedia.org - https://phabricator.wikimedia.org/T407805 [21:16:23] danisztls: all yours [21:16:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225664 (https://phabricator.wikimedia.org/T413022) (owner: 10DDesouza) [21:16:35] danisztls: lmk when you're done and i can do the rest in the queue [21:16:41] cjming: ok! [21:17:26] (03Merged) 10jenkins-bot: Undeploy Safety survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225664 (https://phabricator.wikimedia.org/T413022) (owner: 10DDesouza) [21:17:56] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1225664|Undeploy Safety survey (T413022)]] [21:18:00] T413022: First test, then launch the 2026 Community Safety survey - https://phabricator.wikimedia.org/T413022 [21:19:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11519101 (10VRiley-WMF) [21:20:08] !log dani@deploy2002 dani: Backport for [[gerrit:1225664|Undeploy Safety survey (T413022)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:20:31] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:20:57] !log dani@deploy2002 dani: Continuing with sync [21:21:00] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1352.eqiad.wmnet with OS bookworm [21:21:07] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1353.eqiad.wmnet with OS bookworm [21:21:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519103 (10Jclark-ctr) [21:21:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519104 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1352.eqiad.wmnet with OS bookworm executed with error... [21:21:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519105 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1353.eqiad.wmnet with OS bookworm executed with error... [21:21:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:21:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1355.eqiad.wmnet with OS bookworm [21:21:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1355.eqiad.wmnet with OS bookworm completed: - wikiku... [21:23:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87483 and previous config saved to /var/cache/conftool/dbconfig/20260113-212333-marostegui.json [21:23:41] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [21:23:42] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [21:23:51] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1353.eqiad.wmnet with OS bookworm [21:23:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2247.codfw.wmnet with reason: Maintenance [21:24:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1353.eqiad.wmnet with OS bookworm [21:24:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2247 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87484 and previous config saved to /var/cache/conftool/dbconfig/20260113-212400-marostegui.json [21:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:24:55] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225664|Undeploy Safety survey (T413022)]] (duration: 06m 59s) [21:24:59] T413022: First test, then launch the 2026 Community Safety survey - https://phabricator.wikimedia.org/T413022 [21:25:05] cjming: done. thanks! [21:25:13] great - thx [21:25:35] Pppery: onto your patches [21:25:39] Ok [21:25:41] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1352.eqiad.wmnet with OS bookworm [21:25:48] (03PS2) 10Pppery: Siwiki: Add MOS namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226331 (https://phabricator.wikimedia.org/T414159) [21:25:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519146 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1352.eqiad.wmnet with OS bookworm [21:25:54] You'll need to run namespaceDupes on siwiki [21:25:59] yup [21:26:08] is there a script that needs to be run for your 2nd patch? [21:26:18] Yes, looking [21:26:35] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Purging [21:26:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226331 (https://phabricator.wikimedia.org/T414159) (owner: 10Pppery) [21:27:23] (03Merged) 10jenkins-bot: Siwiki: Add MOS namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226331 (https://phabricator.wikimedia.org/T414159) (owner: 10Pppery) [21:27:55] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1226331|Siwiki: Add MOS namespace (T414159)]] [21:27:58] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1352.eqiad.wmnet with reason: host reimage [21:28:00] T414159: MOS prefix pages in siwiki - https://phabricator.wikimedia.org/T414159 [21:29:06] You'll need to purge both the logo URL (https://en.wikipedia.org/static/images/project-logos/urwikiquote.png) and the wordmark URL (https://en.wikipedia.org/static/images/mobile/copyright/wikiquote-wordmark-ur.svg) [21:29:20] sounds good - will do [21:30:06] !log cjming@deploy2002 pppery, cjming: Backport for [[gerrit:1226331|Siwiki: Add MOS namespace (T414159)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:30:22] Pppery: want to check 1st patch? [21:30:27] lmk when to sync [21:30:33] checking now [21:31:01] Looks good, proceed [21:31:49] !log cjming@deploy2002 pppery, cjming: Continuing with sync [21:32:09] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1372.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:33:27] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1372.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:33:39] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1352.eqiad.wmnet with reason: host reimage [21:35:44] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226331|Siwiki: Add MOS namespace (T414159)]] (duration: 07m 49s) [21:35:47] T414159: MOS prefix pages in siwiki - https://phabricator.wikimedia.org/T414159 [21:36:31] (03PS2) 10Pppery: Urwikiquote: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226329 (https://phabricator.wikimedia.org/T413592) [21:36:40] !log cjming@deploy2002 mwscript-k8s job started: namespaceDupes siwiki --fix # T414159 [21:37:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226329 (https://phabricator.wikimedia.org/T413592) (owner: 10Pppery) [21:37:34] Pppery: ran script for siwiki [21:37:40] OK [21:38:21] That didn't do what I expected it do do. I'll clean up on the wiki manually [21:38:37] (03Merged) 10jenkins-bot: Urwikiquote: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226329 (https://phabricator.wikimedia.org/T413592) (owner: 10Pppery) [21:38:57] (And submit a patch to make namespaceDupes handle this edge case better) [21:39:08] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1226329|Urwikiquote: Update logo (T413592)]] [21:39:12] T413592: Urdu Wikiquote update wordmark - https://phabricator.wikimedia.org/T413592 [21:40:18] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1371.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:41:21] !log cjming@deploy2002 pppery, cjming: Backport for [[gerrit:1226329|Urwikiquote: Update logo (T413592)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:41:25] ok [21:41:40] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:42:16] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:42:16] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:42:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87485 and previous config saved to /var/cache/conftool/dbconfig/20260113-214244-marostegui.json [21:42:50] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [21:42:50] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [21:43:43] Well, I can confirm the new logos files deployed. It's possible that they are rendering wrong or backwards but if that's the case the wiki admins will point that out and upload a new file. In the mean time you can proceed [21:43:53] alrighty [21:43:57] !log cjming@deploy2002 pppery, cjming: Continuing with sync [21:44:17] vriley@cumin1003 provision (PID 1358319) is awaiting input [21:44:27] Gosh neither of my two patches today went quite according to plan (for reasons that are kind of beyond my control). I'm really not having a good day today [21:44:47] sorry to hear - if it's any consolation, that was me yesterday! [21:47:06] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55565 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:47:06] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:47:56] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226329|Urwikiquote: Update logo (T413592)]] (duration: 08m 47s) [21:47:59] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1371.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:48:00] T413592: Urdu Wikiquote update wordmark - https://phabricator.wikimedia.org/T413592 [21:48:08] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1371 [21:48:17] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1371 [21:48:42] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [21:49:13] Pppery: all done - ran purgeList for both files [21:49:18] OK [21:50:52] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:51:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [21:51:25] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1352.eqiad.wmnet with OS bookworm [21:51:27] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:51:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1352.eqiad.wmnet with OS bookworm completed: - wikiku... [21:52:19] !log end of UTC late backport window [21:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P87486 and previous config saved to /var/cache/conftool/dbconfig/20260113-215252-marostegui.json [21:52:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11519287 (10KReid-WMF) [21:54:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519288 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1353.eqiad.wmnet with OS bookworm executed with error... [21:55:22] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [21:55:58] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1371.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:57:17] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1371.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T2200) [22:00:47] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1365 - vriley@cumin1003" [22:00:52] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1365 - vriley@cumin1003" [22:00:52] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:01:27] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1365 [22:03:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P87487 and previous config saved to /var/cache/conftool/dbconfig/20260113-220300-marostegui.json [22:04:37] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1365 [22:05:17] jouncebot: nowandnext [22:05:17] For the next 0 hour(s) and 54 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260113T2200) [22:05:17] In 8 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260114T0700) [22:05:47] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1371.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:06:10] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1371.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:06:11] (03CR) 10Zabe: [C:03+2] Start writing to il_target_id on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226350 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe) [22:07:12] (03Merged) 10jenkins-bot: Start writing to il_target_id on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226350 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe) [22:07:46] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1226350|Start writing to il_target_id on group0 wikis (T413526)]] [22:07:50] T413526: Set imagelinks migration to write both - https://phabricator.wikimedia.org/T413526 [22:07:57] vriley@cumin1003 provision (PID 1363536) is awaiting input [22:08:14] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1365.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:09:47] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1371.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:09:51] !log zabe@deploy2002 zabe: Backport for [[gerrit:1226350|Start writing to il_target_id on group0 wikis (T413526)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:10:37] !log zabe@deploy2002 zabe: Continuing with sync [22:11:37] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1363.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:12:05] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1363.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:12:49] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1363.eqiad.wmnet with OS trixie [22:13:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519310 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1363.eqiad.wmnet with OS trixie [22:13:08] (03PS1) 10Pppery: Urwikiquote: restore flipped icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226366 (https://phabricator.wikimedia.org/T413592) [22:13:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87488 and previous config saved to /var/cache/conftool/dbconfig/20260113-221309-marostegui.json [22:13:16] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:13:16] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:13:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1252.eqiad.wmnet with reason: Maintenance [22:13:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1252 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87489 and previous config saved to /var/cache/conftool/dbconfig/20260113-221333-marostegui.json [22:14:41] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226350|Start writing to il_target_id on group0 wikis (T413526)]] (duration: 06m 55s) [22:14:45] T413526: Set imagelinks migration to write both - https://phabricator.wikimedia.org/T413526 [22:15:24] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1365.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:17:00] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1371.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:17:48] 06SRE, 06Release-Engineering-Team, 10Scap, 06serviceops, and 2 others: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11519322 (10dancy) [22:24:14] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1363.eqiad.wmnet with reason: host reimage [22:25:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 9.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:26:51] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1371.eqiad.wmnet with OS trixie [22:27:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519340 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1371.eqiad.wmnet with OS trixie [22:27:04] (03PS3) 10Jasmine: charts: add sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225042 [22:27:04] (03PS4) 10Jasmine: helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998 [22:28:50] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1363.eqiad.wmnet with reason: host reimage [22:30:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 10.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:34:30] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.009e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [22:37:10] (03CR) 10Zabe: [C:03+2] Stop setting import source for crwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226266 (https://phabricator.wikimedia.org/T411501) (owner: 10Zabe) [22:37:50] (03PS1) 10Ryan Kemper: wdqs: allowlist DNB [puppet] - 10https://gerrit.wikimedia.org/r/1226367 (https://phabricator.wikimedia.org/T406721) [22:37:52] (03PS1) 10Ryan Kemper: wdqs: allowlist GTAA [puppet] - 10https://gerrit.wikimedia.org/r/1226368 (https://phabricator.wikimedia.org/T413226) [22:37:54] (03PS1) 10Ryan Kemper: wdqs: allowlist ODL (dutch media) [puppet] - 10https://gerrit.wikimedia.org/r/1226369 (https://phabricator.wikimedia.org/T412969) [22:38:17] (03Merged) 10jenkins-bot: Stop setting import source for crwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226266 (https://phabricator.wikimedia.org/T411501) (owner: 10Zabe) [22:38:22] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1371.eqiad.wmnet with reason: host reimage [22:38:51] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1226266|Stop setting import source for crwiki (T411501)]] [22:38:55] T411501: Close crwiki and klwiki - https://phabricator.wikimedia.org/T411501 [22:40:57] !log zabe@deploy2002 zabe: Backport for [[gerrit:1226266|Stop setting import source for crwiki (T411501)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:41:18] !log zabe@deploy2002 zabe: Continuing with sync [22:43:52] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [22:44:16] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1371.eqiad.wmnet with reason: host reimage [22:44:55] (03CR) 10Bking: [C:03+1] wdqs: allowlist DNB [puppet] - 10https://gerrit.wikimedia.org/r/1226367 (https://phabricator.wikimedia.org/T406721) (owner: 10Ryan Kemper) [22:45:15] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226266|Stop setting import source for crwiki (T411501)]] (duration: 06m 24s) [22:45:17] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [22:45:18] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1363.eqiad.wmnet with OS trixie [22:45:19] T411501: Close crwiki and klwiki - https://phabricator.wikimedia.org/T411501 [22:45:21] (03CR) 10Bking: [C:03+1] wdqs: allowlist GTAA [puppet] - 10https://gerrit.wikimedia.org/r/1226368 (https://phabricator.wikimedia.org/T413226) (owner: 10Ryan Kemper) [22:45:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1363.eqiad.wmnet with OS trixie completed: - wikikub... [22:46:07] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1365.eqiad.wmnet with OS trixie [22:46:15] (03CR) 10Bking: [C:03+1] wdqs: allowlist ODL (dutch media) [puppet] - 10https://gerrit.wikimedia.org/r/1226369 (https://phabricator.wikimedia.org/T412969) (owner: 10Ryan Kemper) [22:46:27] (03CR) 10Ryan Kemper: [C:03+2] wdqs: allowlist ODL (dutch media) [puppet] - 10https://gerrit.wikimedia.org/r/1226369 (https://phabricator.wikimedia.org/T412969) (owner: 10Ryan Kemper) [22:46:28] (03CR) 10Ryan Kemper: [C:03+2] wdqs: allowlist GTAA [puppet] - 10https://gerrit.wikimedia.org/r/1226368 (https://phabricator.wikimedia.org/T413226) (owner: 10Ryan Kemper) [22:46:30] (03CR) 10Ryan Kemper: [C:03+2] wdqs: allowlist DNB [puppet] - 10https://gerrit.wikimedia.org/r/1226367 (https://phabricator.wikimedia.org/T406721) (owner: 10Ryan Kemper) [22:46:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1365.eqiad.wmnet with OS trixie [22:48:17] (03PS1) 10JHathaway: firewall: add to role::wmcs::instance, remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1226371 (https://phabricator.wikimedia.org/T411089) [22:49:12] (03CR) 10JHathaway: firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [22:52:51] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1366.eqiad.wmnet with OS trixie [22:53:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519423 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1366.eqiad.wmnet with OS trixie [22:55:19] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [22:57:10] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1365.eqiad.wmnet with reason: host reimage [23:01:20] (03PS5) 10Jasmine: helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998 [23:01:23] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [23:01:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [23:01:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1371.eqiad.wmnet with OS trixie [23:02:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519427 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1371.eqiad.wmnet with OS trixie completed: - wikikub... [23:03:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1365.eqiad.wmnet with reason: host reimage [23:04:49] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1366.eqiad.wmnet with reason: host reimage [23:08:41] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1366.eqiad.wmnet with reason: host reimage [23:11:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519446 (10VRiley-WMF) [23:14:49] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1226311 (owner: 10Hnowlan) [23:19:56] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [23:20:51] (03PS5) 10Zabe: manage-dblist: Improve generation of db-sections.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225171 [23:21:18] (03CR) 10Zabe: [C:03+2] manage-dblist: Improve generation of db-sections.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225171 (owner: 10Zabe) [23:22:06] (03Merged) 10jenkins-bot: manage-dblist: Improve generation of db-sections.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225171 (owner: 10Zabe) [23:22:42] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1225171|manage-dblist: Improve generation of db-sections.php]] [23:23:00] vriley@cumin1003 reimage (PID 1369697) is awaiting input [23:23:35] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [23:23:36] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1365.eqiad.wmnet with OS trixie [23:23:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519469 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1365.eqiad.wmnet with OS trixie completed: - wikikub... [23:24:48] !log zabe@deploy2002 zabe: Backport for [[gerrit:1225171|manage-dblist: Improve generation of db-sections.php]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:25:14] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1366.eqiad.wmnet with OS trixie [23:25:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519475 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1366.eqiad.wmnet with OS trixie completed: - wikikub... [23:27:48] !log zabe@deploy2002 zabe: Continuing with sync [23:28:15] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [23:30:54] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1353.eqiad.wmnet with OS bookworm [23:31:01] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11519483 (10RKemper) When rebooting this server (as part of routine maintenance), it got stuck unable to boot. After powercycling and looking at console com2, it was getti... [23:31:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519486 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1353.eqiad.wmnet with OS bookworm [23:31:52] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225171|manage-dblist: Improve generation of db-sections.php]] (duration: 09m 10s) [23:31:59] RECOVERY - Host an-worker1148 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [23:32:09] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1372.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:32:17] PROBLEM - SSH on an-worker1148 is CRITICAL: connect to address 10.64.142.2 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:33:18] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1353.eqiad.wmnet with reason: host reimage [23:38:06] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:38:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1353.eqiad.wmnet with reason: host reimage [23:39:21] PROBLEM - Host dbprov1004 is DOWN: PING CRITICAL - Packet loss = 100% [23:39:34] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:39:37] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1372.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:39:54] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [23:40:49] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1372.eqiad.wmnet with OS trixie [23:41:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11519507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1372.eqiad.wmnet with OS trixie [23:42:39] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:42:53] (03PS1) 10Zabe: Cleanup manage-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226381 [23:43:17] RECOVERY - SSH on an-worker1148 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:43:41] (03CR) 10CI reject: [V:04-1] Cleanup manage-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226381 (owner: 10Zabe) [23:44:26] (03PS2) 10Zabe: Cleanup manage-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226381 [23:45:27] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1370 [23:45:32] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1370 [23:48:23] (03CR) 10Zabe: [C:03+2] BETA: Start reading from file table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216872 (https://phabricator.wikimedia.org/T412164) (owner: 10Zabe) [23:49:08] (03Merged) 10jenkins-bot: BETA: Start reading from file table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216872 (https://phabricator.wikimedia.org/T412164) (owner: 10Zabe) [23:49:15] RECOVERY - MegaRAID on an-worker1148 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:49:24] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:50:50] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1370.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:52:02] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1372.eqiad.wmnet with reason: host reimage [23:55:22] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [23:55:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [23:55:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1353.eqiad.wmnet with OS bookworm [23:55:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519514 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1353.eqiad.wmnet with OS bookworm completed: - wikiku... [23:56:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519515 (10Jclark-ctr) [23:57:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11519527 (10Jclark-ctr) 05Open→03Resolved [23:58:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1372.eqiad.wmnet with reason: host reimage