[00:02:32] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1236859 [00:03:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 17.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:08:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 19.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:12:15] (03CR) 10Reedy: [C:03+2] Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236692 (https://phabricator.wikimedia.org/T416456) (owner: 10Zabe) [00:12:17] (03CR) 10Reedy: [C:03+2] Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" [vendor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236693 (https://phabricator.wikimedia.org/T416456) (owner: 10Zabe) [00:24:11] (03Merged) 10jenkins-bot: Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" [vendor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236693 (https://phabricator.wikimedia.org/T416456) (owner: 10Zabe) [00:25:13] (03Merged) 10jenkins-bot: Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236692 (https://phabricator.wikimedia.org/T416456) (owner: 10Zabe) [00:30:02] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1236692|Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" (T416456)]], [[gerrit:1236693|Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" (T416456)]] [00:30:05] T416456: Lcobucci\JWT\Signer\InvalidKeyProvided: Key cannot be empty (/w/rest.php/oauth2/access_token) - https://phabricator.wikimedia.org/T416456 [00:30:35] thx Reedy. [00:32:16] !log reedy@deploy2002 reedy, zabe: Backport for [[gerrit:1236692|Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" (T416456)]], [[gerrit:1236693|Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" (T416456)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:32:35] !log reedy@deploy2002 reedy, zabe: Continuing with sync [00:36:52] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1236692|Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" (T416456)]], [[gerrit:1236693|Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" (T416456)]] (duration: 06m 50s) [00:36:55] T416456: Lcobucci\JWT\Signer\InvalidKeyProvided: Key cannot be empty (/w/rest.php/oauth2/access_token) - https://phabricator.wikimedia.org/T416456 [00:40:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1236861 [00:40:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1236861 (owner: 10TrainBranchBot) [00:41:11] (03PS1) 10Samwilson: jquery.wikiEditor.js: disable resizing bar on proofread-page [extensions/WikiEditor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236862 (https://phabricator.wikimedia.org/T393231) [00:43:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T415786)', diff saved to https://phabricator.wikimedia.org/P88676 and previous config saved to /var/cache/conftool/dbconfig/20260205-004353-marostegui.json [00:43:57] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [00:51:29] !log ryankemper@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [00:51:34] !log ryankemper@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [00:53:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1236861 (owner: 10TrainBranchBot) [00:55:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samwilson@deploy2002 using scap backport" [extensions/WikiEditor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236862 (https://phabricator.wikimedia.org/T393231) (owner: 10Samwilson) [00:57:02] (03Merged) 10jenkins-bot: jquery.wikiEditor.js: disable resizing bar on proofread-page [extensions/WikiEditor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236862 (https://phabricator.wikimedia.org/T393231) (owner: 10Samwilson) [00:57:56] !log samwilson@deploy2002 Started scap sync-world: Backport for [[gerrit:1236862|jquery.wikiEditor.js: disable resizing bar on proofread-page (T393231)]] [00:57:59] T393231: Show bottom resize bar for all content models - https://phabricator.wikimedia.org/T393231 [00:59:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P88677 and previous config saved to /var/cache/conftool/dbconfig/20260205-005902-marostegui.json [00:59:16] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [01:00:07] !log samwilson@deploy2002 samwilson: Backport for [[gerrit:1236862|jquery.wikiEditor.js: disable resizing bar on proofread-page (T393231)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:02:10] !log samwilson@deploy2002 samwilson: Continuing with sync [01:06:17] !log samwilson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1236862|jquery.wikiEditor.js: disable resizing bar on proofread-page (T393231)]] (duration: 08m 21s) [01:06:20] T393231: Show bottom resize bar for all content models - https://phabricator.wikimedia.org/T393231 [01:10:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1236863 [01:10:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1236863 (owner: 10TrainBranchBot) [01:14:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P88678 and previous config saved to /var/cache/conftool/dbconfig/20260205-011410-marostegui.json [01:29:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T415786)', diff saved to https://phabricator.wikimedia.org/P88679 and previous config saved to /var/cache/conftool/dbconfig/20260205-012918-marostegui.json [01:29:22] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [01:29:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2220.codfw.wmnet with reason: Maintenance [01:29:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2220 (T415786)', diff saved to https://phabricator.wikimedia.org/P88680 and previous config saved to /var/cache/conftool/dbconfig/20260205-012942-marostegui.json [01:34:09] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1236863 (owner: 10TrainBranchBot) [01:52:36] (03PS1) 10Bhsd: Revert "Support WikiEditor's resizing drag bar for Page editing" [extensions/ProofreadPage] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236865 (https://phabricator.wikimedia.org/T393231) [01:58:01] (03PS1) 10Bhsd: Revert "Support WikiEditor's resizing drag bar for Page editing" [extensions/ProofreadPage] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236866 (https://phabricator.wikimedia.org/T393231) [01:58:27] (03Abandoned) 10Bhsd: Revert "Support WikiEditor's resizing drag bar for Page editing" [extensions/ProofreadPage] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236865 (https://phabricator.wikimedia.org/T393231) (owner: 10Bhsd) [02:00:51] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:23:12] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 22m 21s) [02:30:44] (03CR) 10Samwilson: [C:03+1] "I'll deploy this now." [extensions/ProofreadPage] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236866 (https://phabricator.wikimedia.org/T393231) (owner: 10Bhsd) [02:31:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samwilson@deploy2002 using scap backport" [extensions/ProofreadPage] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236866 (https://phabricator.wikimedia.org/T393231) (owner: 10Bhsd) [02:33:01] (03Merged) 10jenkins-bot: Revert "Support WikiEditor's resizing drag bar for Page editing" [extensions/ProofreadPage] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236866 (https://phabricator.wikimedia.org/T393231) (owner: 10Bhsd) [02:33:34] !log samwilson@deploy2002 Started scap sync-world: Backport for [[gerrit:1236866|Revert "Support WikiEditor's resizing drag bar for Page editing" (T393231)]] [02:33:37] T393231: Show bottom resize bar for all content models - https://phabricator.wikimedia.org/T393231 [02:35:46] !log samwilson@deploy2002 samwilson, bhsd: Backport for [[gerrit:1236866|Revert "Support WikiEditor's resizing drag bar for Page editing" (T393231)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:36:47] !log samwilson@deploy2002 samwilson, bhsd: Continuing with sync [02:40:54] !log samwilson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1236866|Revert "Support WikiEditor's resizing drag bar for Page editing" (T393231)]] (duration: 07m 20s) [02:40:58] T393231: Show bottom resize bar for all content models - https://phabricator.wikimedia.org/T393231 [02:58:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T415786)', diff saved to https://phabricator.wikimedia.org/P88681 and previous config saved to /var/cache/conftool/dbconfig/20260205-025845-marostegui.json [02:58:49] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [03:13:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P88682 and previous config saved to /var/cache/conftool/dbconfig/20260205-031354-marostegui.json [03:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [03:29:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P88683 and previous config saved to /var/cache/conftool/dbconfig/20260205-032902-marostegui.json [03:29:16] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:44:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T415786)', diff saved to https://phabricator.wikimedia.org/P88684 and previous config saved to /var/cache/conftool/dbconfig/20260205-034410-marostegui.json [03:44:14] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [03:44:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2221.codfw.wmnet with reason: Maintenance [03:44:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2221 (T415786)', diff saved to https://phabricator.wikimedia.org/P88685 and previous config saved to /var/cache/conftool/dbconfig/20260205-034435-marostegui.json [04:03:06] (03PS1) 10Zabe: Start reading from file table on testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236870 (https://phabricator.wikimedia.org/T416548) [04:59:16] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [05:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T415786)', diff saved to https://phabricator.wikimedia.org/P88686 and previous config saved to /var/cache/conftool/dbconfig/20260205-051441-marostegui.json [05:14:44] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [05:29:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P88687 and previous config saved to /var/cache/conftool/dbconfig/20260205-052949-marostegui.json [05:34:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:44:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P88688 and previous config saved to /var/cache/conftool/dbconfig/20260205-054457-marostegui.json [06:00:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T415786)', diff saved to https://phabricator.wikimedia.org/P88689 and previous config saved to /var/cache/conftool/dbconfig/20260205-060006-marostegui.json [06:00:10] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [06:00:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2222.codfw.wmnet with reason: Maintenance [06:00:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2222 (T415786)', diff saved to https://phabricator.wikimedia.org/P88690 and previous config saved to /var/cache/conftool/dbconfig/20260205-060031-marostegui.json [06:22:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2209 with weight 0 T416299', diff saved to https://phabricator.wikimedia.org/P88691 and previous config saved to /var/cache/conftool/dbconfig/20260205-062215-marostegui.json [06:22:19] T416299: Switchover s3 master (db2205 -> db2209) - https://phabricator.wikimedia.org/T416299 [06:22:29] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1236107 (https://phabricator.wikimedia.org/T416299) (owner: 10Gerrit maintenance bot) [06:22:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T416299 [06:23:12] !log Starting s3 codfw failover from db2205 to db2209 - T416299 [06:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s3 codfw as read-only for maintenance - T416299', diff saved to https://phabricator.wikimedia.org/P88692 and previous config saved to /var/cache/conftool/dbconfig/20260205-062557-marostegui.json [06:26:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2209 to s3 primary and set section read-write T416299', diff saved to https://phabricator.wikimedia.org/P88693 and previous config saved to /var/cache/conftool/dbconfig/20260205-062617-marostegui.json [06:26:39] !log marostegui@dns1006 START - running authdns-update [06:27:11] (03CR) 10Marostegui: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1236108 (https://phabricator.wikimedia.org/T416299) (owner: 10Gerrit maintenance bot) [06:27:26] (03Abandoned) 10Marostegui: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1236108 (https://phabricator.wikimedia.org/T416299) (owner: 10Gerrit maintenance bot) [06:27:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2205 T416299', diff saved to https://phabricator.wikimedia.org/P88694 and previous config saved to /var/cache/conftool/dbconfig/20260205-062737-marostegui.json [06:27:40] !log marostegui@dns1006 END - running authdns-update [06:27:41] T416299: Switchover s3 master (db2205 -> db2209) - https://phabricator.wikimedia.org/T416299 [06:29:06] (03PS1) 10Marostegui: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1237053 (https://phabricator.wikimedia.org/T416299) [06:31:06] (03CR) 10Marostegui: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1237053 (https://phabricator.wikimedia.org/T416299) (owner: 10Marostegui) [06:31:09] !log marostegui@dns1006 START - running authdns-update [06:32:12] (03PS1) 10Marostegui: db2205: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1237054 [06:32:15] !log marostegui@dns1006 END - running authdns-update [06:32:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2205.codfw.wmnet with reason: Schema change [06:32:46] (03CR) 10Marostegui: [C:03+2] db2205: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1237054 (owner: 10Marostegui) [06:33:22] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2205.codfw.wmnet with reason: Maintenance [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T0700) [07:00:05] marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T0700) [07:12:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236843 (owner: 10Bartosz Dziewoński) [07:12:47] (03PS1) 10Alexandros Kosiaris: offboarding akosiaris [puppet] - 10https://gerrit.wikimedia.org/r/1237055 [07:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [07:14:45] (03CR) 10CI reject: [V:04-1] offboarding akosiaris [puppet] - 10https://gerrit.wikimedia.org/r/1237055 (owner: 10Alexandros Kosiaris) [07:15:10] Even CI doesn't want you to leave akosiaris [07:15:37] (03CR) 10Alexandros Kosiaris: [C:04-1] "To be merged on 2026-02-13 (my last day). Adding Daniel (my onboarding buddy back in 2013), so that he merges it as my "offboarding buddy"" [puppet] - 10https://gerrit.wikimedia.org/r/1237055 (owner: 10Alexandros Kosiaris) [07:16:14] marostegui: even CI... [07:18:26] (03PS2) 10Alexandros Kosiaris: offboarding akosiaris [puppet] - 10https://gerrit.wikimedia.org/r/1237055 [07:20:48] (03CR) 10Alexandros Kosiaris: [C:04-1] "CI fixed, comment stands" [puppet] - 10https://gerrit.wikimedia.org/r/1237055 (owner: 10Alexandros Kosiaris) [07:27:58] 06SRE, 10LDAP-Access-Requests: Grant Access to bitu-account-managers(?) for reedy - https://phabricator.wikimedia.org/T416062#11586056 (10MoritzMuehlenhoff) 05Open→03Resolved Access was enabled via Wikimedia IDM. [07:29:16] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:30:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T415786)', diff saved to https://phabricator.wikimedia.org/P88695 and previous config saved to /var/cache/conftool/dbconfig/20260205-073011-marostegui.json [07:30:15] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [07:32:18] (03PS1) 10Muehlenhoff: Remove bd808 from puppetised config in favour of LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1237058 [07:36:23] !log installing openjdk-25 security updates [07:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:49] (03PS1) 10Kevin Bazira: ml: add vLLM 0.14 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1237060 (https://phabricator.wikimedia.org/T415627) [07:42:55] !log installing openjdk-21 security updates [07:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P88696 and previous config saved to /var/cache/conftool/dbconfig/20260205-074519-marostegui.json [07:53:58] (03PS6) 10Clément Goubert: Cleanup redundant lint-related rest gateway routing config [puppet] - 10https://gerrit.wikimedia.org/r/1210631 (owner: 10Aaron Schulz) [07:56:00] (03CR) 10Slyngshede: [C:03+1] Remove bd808 from puppetised config in favour of LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1237058 (owner: 10Muehlenhoff) [07:57:27] jouncebot: now [07:57:27] For the next 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T0700) [07:57:30] jouncebot: nowandnext [07:57:30] For the next 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T0700) [07:57:30] In 0 hour(s) and 2 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T0800) [07:59:28] (03PS1) 10Muehlenhoff: Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1237127 [08:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T0800) [08:00:05] Msz2001: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:13] o/ [08:00:22] I'm a deployer, I'll do it myself [08:00:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P88697 and previous config saved to /var/cache/conftool/dbconfig/20260205-080027-marostegui.json [08:00:37] excellent! Welcome aboard :-] [08:01:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236843 (owner: 10Bartosz Dziewoński) [08:01:56] (03Merged) 10jenkins-bot: Remove unused 'editor' right from plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236843 (owner: 10Bartosz Dziewoński) [08:02:42] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1236843|Remove unused 'editor' right from plwiki]] [08:05:26] !log mszwarc@deploy2002 matmarex, mszwarc: Backport for [[gerrit:1236843|Remove unused 'editor' right from plwiki]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:07:04] !log mszwarc@deploy2002 matmarex, mszwarc: Continuing with sync [08:07:31] (03PS1) 10Muehlenhoff: Make bast1004 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/1237128 [08:11:16] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1236843|Remove unused 'editor' right from plwiki]] (duration: 08m 33s) [08:11:37] I'm done. Is there anyone else who wants to deploy? [08:12:27] If not, let's call it a day [08:12:39] !log Morning backport window finished [08:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T415786)', diff saved to https://phabricator.wikimedia.org/P88698 and previous config saved to /var/cache/conftool/dbconfig/20260205-081536-marostegui.json [08:15:40] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [08:34:23] (03CR) 10Slyngshede: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1237127 (owner: 10Muehlenhoff) [08:36:39] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11586139 (10lerickson) Hi! Sorry for the slow response. I'm requesting analytics-privatedata-users access level 3. I'm a Wikidata Platform SWE and will need HDFS access and t... [08:41:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11586151 (10cmooney) @VRiley I think there is a slight mix-up with [[ https://netbox.wikimedia.org/dcim/devices/6608/ | frqueue1005 ]] and [[ https://ne... [08:43:33] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup2015 - https://phabricator.wikimedia.org/T414724#11586152 (10jcrespo) [08:49:20] (03CR) 10Arnaudb: [C:03+1] "also checked the dashboard pane and the runbooks, everything looks good to me!" [alerts] - 10https://gerrit.wikimedia.org/r/1236746 (https://phabricator.wikimedia.org/T416189) (owner: 10Jelto) [08:52:36] (03CR) 10Elukey: [C:03+1] Make bast1004 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/1237128 (owner: 10Muehlenhoff) [08:55:28] (03PS1) 10Marostegui: Revert "db2205: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1237135 [08:56:07] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2205 gradually with 4 steps - After schema change [08:56:09] (03CR) 10Marostegui: [C:03+2] Revert "db2205: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1237135 (owner: 10Marostegui) [08:57:09] (03CR) 10Muehlenhoff: [C:03+2] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1237127 (owner: 10Muehlenhoff) [08:57:15] !log jmm@dns1004 START - running authdns-update [08:58:22] !log jmm@dns1004 END - running authdns-update [08:59:02] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1237136 (https://phabricator.wikimedia.org/T416554) [08:59:08] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1237137 (https://phabricator.wikimedia.org/T416554) [08:59:16] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [08:59:42] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1237138 (https://phabricator.wikimedia.org/T416555) [08:59:47] (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1237139 (https://phabricator.wikimedia.org/T416555) [09:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T0900) [09:01:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1163 with weight 0 T416480', diff saved to https://phabricator.wikimedia.org/P88701 and previous config saved to /var/cache/conftool/dbconfig/20260205-090145-marostegui.json [09:01:49] T416480: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T416480 [09:01:55] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1236753 (https://phabricator.wikimedia.org/T416480) (owner: 10Gerrit maintenance bot) [09:02:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T416480 [09:02:44] !log Starting s1 eqiad failover from db1184 to db1163 - T416480 [09:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:23] !log update hosts running routed Ganeti to dnsmasq 2.92-1~wmf12u1 T396864 [09:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:26] T396864: Routed Ganeti: same node DHCP limitation - https://phabricator.wikimedia.org/T396864 [09:04:29] (03PS1) 10Marostegui: db1184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1237140 [09:05:47] (03CR) 10Marostegui: [C:03+2] db1184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1237140 (owner: 10Marostegui) [09:05:51] (03CR) 10Fabfur: "vtc tests ok" [puppet] - 10https://gerrit.wikimedia.org/r/1236703 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:05:53] (03CR) 10Fabfur: [C:03+2] cache::upload: enable global ratelimiting for bot (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1236703 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:06:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1163 to s1 primary T416480', diff saved to https://phabricator.wikimedia.org/P88702 and previous config saved to /var/cache/conftool/dbconfig/20260205-090623-marostegui.json [09:07:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1184 T416480', diff saved to https://phabricator.wikimedia.org/P88703 and previous config saved to /var/cache/conftool/dbconfig/20260205-090702-marostegui.json [09:07:06] T416480: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T416480 [09:07:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1184.eqiad.wmnet with reason: Schema change [09:09:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1184.eqiad.wmnet with reason: Maintenance [09:16:38] (03PS1) 10Gehel: feat(WDQS)!: disable LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237142 (https://phabricator.wikimedia.org/T415696) [09:17:08] (03CR) 10CI reject: [V:04-1] feat(WDQS)!: disable LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237142 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [09:18:18] (03PS2) 10Gehel: feat(WDQS)!: disable LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237142 (https://phabricator.wikimedia.org/T415696) [09:18:39] hashar: Is the train unblocked? Are you rolling out -wmf.14? [09:21:04] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [09:21:33] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host sretest1002 [09:23:11] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [09:25:07] (03CR) 10Muehlenhoff: [C:03+2] Remove bd808 from puppetised config in favour of LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1237058 (owner: 10Muehlenhoff) [09:27:15] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host sretest1002 - ayounsi@cumin1003" [09:27:20] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host sretest1002 - ayounsi@cumin1003" [09:27:20] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:27:20] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache sretest1002.eqiad.wmnet 139.48.64.10.in-addr.arpa 9.3.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:27:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest1002.eqiad.wmnet 139.48.64.10.in-addr.arpa 9.3.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:27:24] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1002 [09:28:19] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [09:28:37] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1002 [09:28:37] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host sretest1002 [09:31:15] phuedx: sorry I got lost into something :] [09:31:33] it was hold on a revert in mediawiki/vendor [09:32:11] that was done by Reedy overnight so I guess my patch and Zabe backport were fine :] [09:32:13] I'll do it [09:32:21] phuedx: did you have something to push? [09:32:29] I don't mind rolling it as part of upgrading group1 [09:32:44] (03PS1) 10Gehel: cleanup(WDQS/traffic): cleanup backend.yaml rules for WDQS LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237145 (https://phabricator.wikimedia.org/T415696) [09:32:47] (03PS1) 10Gehel: cleanup(WDQS): remove monitoring for WDQS LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237146 (https://phabricator.wikimedia.org/T415696) [09:32:49] (03PS1) 10Gehel: cleanup(WDQS): remove WDQS LDF endpoint from cfssl configuration [puppet] - 10https://gerrit.wikimedia.org/r/1237147 (https://phabricator.wikimedia.org/T415696) [09:32:51] (03PS1) 10Gehel: cleanup(WDQS): remove all remaining references to the WDQS LDF endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1237148 (https://phabricator.wikimedia.org/T415696) [09:34:08] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#11586349 (10elukey) I think it will take ~2 weeks to delete tegola-swift-codfw-v002 and other ~2 weeks for the eqiad va... [09:37:03] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [09:37:19] (03PS1) 10Elukey: profile::docker_registry: change the ml's registry Redis database [puppet] - 10https://gerrit.wikimedia.org/r/1237149 [09:37:23] hashar: Nothing to push. Code rolling out on -wmf.14 that I'm watching [09:37:30] great! [09:37:35] I am rolling the train [09:37:53] thank you so much for being around to watch the aftermath of some code deployment :] [09:38:07] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237150 (https://phabricator.wikimedia.org/T413805) [09:38:09] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237150 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [09:39:14] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237150 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [09:39:42] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [09:40:30] (03PS2) 10Elukey: profile::docker_registry: change the ml's registry Redis database [puppet] - 10https://gerrit.wikimedia.org/r/1237149 [09:40:41] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237149 (owner: 10Elukey) [09:41:34] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2205 gradually with 4 steps - After schema change [09:44:25] !log ayounsi@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [09:44:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11586372 (10Gehel) [09:44:49] 06SRE, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Request: wdqs shell access for user lerickson - https://phabricator.wikimedia.org/T415373#11586371 (10Gehel) [09:45:06] 06SRE, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Request: wdqs shell access for user lerickson - https://phabricator.wikimedia.org/T415373#11586374 (10Gehel) This task is waiting on shell access to be completed in T415406 [09:45:13] (03CR) 10Dpogorzelski: [C:03+1] ml: add vLLM 0.14 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1237060 (https://phabricator.wikimedia.org/T415627) (owner: 10Kevin Bazira) [09:45:22] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.14 refs T413805 [09:45:25] T413805: 1.46.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T413805 [09:45:37] phuedx: it is live on group1! [09:45:37] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [09:46:21] hashar: Nice. I'm watching logs [09:46:24] (03PS1) 10Ayounsi: reimage: use the freshest IP fpr DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) [09:46:36] Are you going to let it settle and then roll on to group2? [09:46:39] (03PS2) 10Ayounsi: reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) [09:47:41] most probably yes [09:48:18] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [09:48:23] iirc yesterday the only source of errors was for CentralAuth > OAuth > lcobucci/jwt incompatible upgrade [09:48:30] I'll review the process [09:48:54] it is probably fine to promote rapidly [09:49:16] without waiting for the next window (at something like 19:00 UTC) [09:49:41] Great [09:49:42] !log ammarpad@deploy2002 mwscript-k8s job started: refreshImageMetadata.php --wiki=commonswiki --mediatype=AUDIO --mime=application/ogg '--metadata-contains=Stream Undecodable' --force # T414348 [09:49:45] T414348: Some ogg vorbis files fail transcode silently and have duration of 0 - https://phabricator.wikimedia.org/T414348 [09:50:34] + usually trains do not cause much issues anymore [09:51:27] (03CR) 10CI reject: [V:04-1] reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [09:52:06] I'm sure I can find a way to make the train more exciting for the train conductors :P [09:53:56] the exciting part will be when devs start conducting the MW deployment by themselves [09:54:10] and we might not be that far frmo reaching that point! [09:54:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236704 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:56:08] (03CR) 10Elukey: [C:03+2] profile::docker_registry: change the ml's registry Redis database [puppet] - 10https://gerrit.wikimedia.org/r/1237149 (owner: 10Elukey) [09:57:02] (03CR) 10Fabfur: "vtc tests are ok" [puppet] - 10https://gerrit.wikimedia.org/r/1236704 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:57:26] (03PS1) 10Fabfur: Revert "cache::upload: enable global ratelimiting for bot (codfw)" [puppet] - 10https://gerrit.wikimedia.org/r/1237154 [09:58:01] (03CR) 10Vgutierrez: [C:03+1] Revert "cache::upload: enable global ratelimiting for bot (codfw)" [puppet] - 10https://gerrit.wikimedia.org/r/1237154 (owner: 10Fabfur) [09:58:02] (03CR) 10Fabfur: [C:03+2] Revert "cache::upload: enable global ratelimiting for bot (codfw)" [puppet] - 10https://gerrit.wikimedia.org/r/1237154 (owner: 10Fabfur) [10:00:07] (03CR) 10Jelto: [C:03+2] gerrit: add GerritHaProxy* alerts [alerts] - 10https://gerrit.wikimedia.org/r/1236746 (https://phabricator.wikimedia.org/T416189) (owner: 10Jelto) [10:00:07] (03PS1) 10Tiziano Fogli: thanos/compact: revert concurrency to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1237156 (https://phabricator.wikimedia.org/T410152) [10:01:12] (03CR) 10Tiziano Fogli: [C:03+2] "I’m self-merging since this is a well-known procedure." [puppet] - 10https://gerrit.wikimedia.org/r/1237156 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli) [10:01:44] (03Merged) 10jenkins-bot: gerrit: add GerritHaProxy* alerts [alerts] - 10https://gerrit.wikimedia.org/r/1236746 (https://phabricator.wikimedia.org/T416189) (owner: 10Jelto) [10:05:31] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11586492 (10elukey) [10:13:18] !log ayounsi@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [10:13:26] well it is all quiet [10:13:43] (03PS1) 10Fabfur: Revert^2 "cache::upload: enable global ratelimiting for bot (codfw)" [puppet] - 10https://gerrit.wikimedia.org/r/1237163 [10:13:48] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [10:13:53] I am going to check a bunch of other metrics and will promote the rest of the wikis [10:14:39] (03CR) 10Fabfur: [C:03+2] Revert^2 "cache::upload: enable global ratelimiting for bot (codfw)" [puppet] - 10https://gerrit.wikimedia.org/r/1237163 (owner: 10Fabfur) [10:15:28] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 3 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11586516 (10elukey) @hashar Hi! Would you be available Mon/Tue next week, during the MW Infrastructure wind... [10:17:50] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11586519 (10elukey) @CDanis Hi! I saw your name for otelcol and this is why I am reaching out :) IIUC it is a golang binary so it shou... [10:18:44] !log ayounsi@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [10:19:15] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [10:23:37] !log ayounsi@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1002.eqiad.wmnet with OS bookworm [10:24:00] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [10:27:57] (03PS1) 10Majavah: idm: Remove taavi from hardcoded list of account managers [puppet] - 10https://gerrit.wikimedia.org/r/1237165 [10:28:48] I am promoting the rest of the wikis [10:29:53] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237166 (https://phabricator.wikimedia.org/T413805) [10:29:56] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237166 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [10:30:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1237165 (owner: 10Majavah) [10:30:58] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237166 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [10:31:01] (03CR) 10Majavah: [C:03+2] idm: Remove taavi from hardcoded list of account managers [puppet] - 10https://gerrit.wikimedia.org/r/1237165 (owner: 10Majavah) [10:32:01] 06SRE, 06Privacy Engineering, 06Traffic: Create and document Wikidough's privacy policy - https://phabricator.wikimedia.org/T275409#11586571 (10Aklapper) a:05ssingh→03None @ssingh: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on... [10:35:53] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v.1.2.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237168 [10:36:16] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v.1.2.0 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237169 [10:36:55] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.14 refs T413805 [10:36:58] T413805: 1.46.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T413805 [10:39:27] (03CR) 10Phuedx: [C:03+1] Test Kitchen UI: Deploy v.1.2.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237168 (owner: 10Santiago Faci) [10:39:37] (03CR) 10Phuedx: [C:03+1] Test Kitchen UI: Deploy v.1.2.0 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237169 (owner: 10Santiago Faci) [10:41:20] (03PS1) 10Santiago Faci: readingListAB.js: Updated to use mw.testKitchen [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237170 (https://phabricator.wikimedia.org/T414435) [10:41:50] (03Abandoned) 10Santiago Faci: Renaming `MetricsPlatform` => `TestKitchen` [extensions/ReadingLists] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236854 (https://phabricator.wikimedia.org/T414435) (owner: 10Santiago Faci) [10:42:01] (03PS1) 10Santiago Faci: Renaming `MetricsPlatform` => `TestKitchen` [extensions/ReadingLists] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237171 (https://phabricator.wikimedia.org/T414435) [10:42:52] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [10:43:59] (03PS1) 10Majavah: hieradata: cloud: Add IPv6 addresses for proxies to Cumin ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1237173 [10:44:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11586631 (10BTullis) [10:46:36] (03PS1) 10Btullis: Add dse-k8s-worker1023 to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1237176 (https://phabricator.wikimedia.org/T414216) [10:46:38] PROBLEM - Host sretest1002 is DOWN: PING CRITICAL - Packet loss = 100% [10:47:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237170 (https://phabricator.wikimedia.org/T414435) (owner: 10Santiago Faci) [10:47:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ReadingLists] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237171 (https://phabricator.wikimedia.org/T414435) (owner: 10Santiago Faci) [10:48:45] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [10:48:55] !log upgrade cloudcumin1001 to bookworm T403153 [10:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:58] T403153: Upgrade cloudcumin hosts to bookworm - https://phabricator.wikimedia.org/T403153 [10:51:22] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v.1.2.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237168 (owner: 10Santiago Faci) [10:51:22] (03CR) 10Btullis: [C:03+2] topic: New Flink application [puppet] - 10https://gerrit.wikimedia.org/r/1236305 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [10:51:39] RECOVERY - Host sretest1002 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [10:52:24] (03CR) 10Vgutierrez: [C:03+1] cache::upload: enable global ratelimiting for bot (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1236704 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:53:10] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v.1.2.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237168 (owner: 10Santiago Faci) [10:53:28] (03CR) 10Fabfur: [C:03+2] cache::upload: enable global ratelimiting for bot (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1236704 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:54:15] (03CR) 10Btullis: [C:03+2] topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236302 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [10:54:47] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [10:56:38] there is an error for CampaignEvents which I have filed as T416569, I don't think it is much of a problem [10:56:39] T416569: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cec_user_name__str' in 'WHERE'Function: MediaWiki\Pager\IndexPager::buildQueryInfo (MediaWiki\Extension\CampaignEvents\Pager\EventContributionsPager)Query: SELECT cec - https://phabricator.wikimedia.org/T416569 [10:56:52] hashar: checking [10:57:11] ahah [10:57:27] so great to see our DBA step in as soon as "Unknown column" is mentioned! [10:57:33] it is an alias apparently rather than an actual missing column [10:57:39] yeah [10:57:41] I just saw :) [10:57:52] (03CR) 10Gehel: [C:03+1] Add dse-k8s-worker1023 to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1237176 (https://phabricator.wikimedia.org/T414216) (owner: 10Btullis) [10:57:58] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [10:58:22] I guess the query is malformed and miss the alias somehow. I imagine devs will figure it out [10:58:25] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host sretest1005 [10:58:28] thank you marostegui ! [10:58:29] :) [10:58:44] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [10:58:48] (03CR) 10Mvolz: [C:03+2] "Thanks for monitoring this and linking the readiness probe dashboard! I'll definitely watch that for future deploys." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236710 (owner: 10Mvolz) [10:58:59] I am off for lunch, I got my phone nearby if needed [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1100) [11:00:30] (03PS1) 10Marostegui: Revert "db1184: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1237180 [11:00:58] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [11:01:08] (03CR) 10Marostegui: [C:03+2] Revert "db1184: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1237180 (owner: 10Marostegui) [11:01:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db1184: After schema change [11:01:17] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [11:01:51] (03CR) 10FNegri: [C:03+1] hieradata: cloud: Add IPv6 addresses for proxies to Cumin ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1237173 (owner: 10Majavah) [11:02:16] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host sretest1005 - ayounsi@cumin1003" [11:02:18] (03CR) 10Majavah: [C:03+2] hieradata: cloud: Add IPv6 addresses for proxies to Cumin ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1237173 (owner: 10Majavah) [11:02:21] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host sretest1005 - ayounsi@cumin1003" [11:02:21] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:02:21] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache sretest1005.eqiad.wmnet 130.32.64.10.in-addr.arpa 0.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:02:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest1005.eqiad.wmnet 130.32.64.10.in-addr.arpa 0.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:02:26] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005 [11:02:28] !log ayounsi@cumin1003 END (ERROR) - Cookbook sre.network.configure-switch-interfaces (exit_code=97) for host sretest1005 [11:02:38] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [11:06:03] (03CR) 10Effie Mouzeli: [C:04-1] "Any errors in this code could potentially break production mw-mcrouter and cause an outage. I suggest ςε discuss on the task what would wo" [puppet] - 10https://gerrit.wikimedia.org/r/1229229 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester) [11:06:07] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rollback records for host sretest1005 - ayounsi@cumin1003" [11:06:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rollback records for host sretest1005 - ayounsi@cumin1003" [11:06:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:06:12] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache sretest1005.eqiad.wmnet 3.141.64.10.in-addr.arpa 3.0.0.0.1.4.1.0.4.6.0.0.0.1.0.0.3.1.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:06:16] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest1005.eqiad.wmnet 3.141.64.10.in-addr.arpa 3.0.0.0.1.4.1.0.4.6.0.0.0.1.0.0.3.1.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:06:16] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.move-vlan (exit_code=99) for host sretest1005 [11:06:16] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1005.eqiad.wmnet with OS bookworm [11:06:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11586748 (10Gehel) p:05Triage→03High [11:06:50] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bookworm [11:07:42] (03CR) 10Btullis: topic: Flink enrichment pipeline (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [11:08:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet [11:08:20] (03PS3) 10Ayounsi: reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) [11:09:17] (03CR) 10Effie Mouzeli: "if this is not valid any more, shall we abandon it?" [puppet] - 10https://gerrit.wikimedia.org/r/1188365 (owner: 10Fabfur) [11:10:07] (03Abandoned) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1187828 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [11:11:27] (03Abandoned) 10Fabfur: Revert "hiera: remove unneeded option for hcaptcha service" [puppet] - 10https://gerrit.wikimedia.org/r/1188365 (owner: 10Fabfur) [11:11:28] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [11:11:43] (03CR) 10Fabfur: "yes thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1188365 (owner: 10Fabfur) [11:12:11] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#11586789 (10taavi) [11:12:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin1001.eqiad.wmnet [11:12:23] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#11586794 (10taavi) [11:12:28] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#11586796 (10taavi) [11:14:16] (03PS4) 10Ayounsi: reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) [11:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [11:14:50] (03PS2) 10JavierMonton: topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236302 (https://phabricator.wikimedia.org/T360794) [11:15:00] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [11:15:17] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host sretest1005 [11:16:40] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [11:20:00] (03PS1) 10Majavah: Add dumps-http.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1237187 (https://phabricator.wikimedia.org/T306550) [11:20:08] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host sretest1005 - ayounsi@cumin1003" [11:20:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host sretest1005 - ayounsi@cumin1003" [11:20:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:20:13] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache sretest1005.eqiad.wmnet 130.32.64.10.in-addr.arpa 0.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:20:16] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest1005.eqiad.wmnet 130.32.64.10.in-addr.arpa 0.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:20:17] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005 [11:21:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1005 [11:21:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host sretest1005 [11:24:08] (03CR) 10FNegri: wmcs: fix infra-tracing-nfs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1231034 (https://phabricator.wikimedia.org/T415199) (owner: 10Volans) [11:27:03] (03CR) 10Alexandros Kosiaris: "A few years later, the diff is substantially smaller. Only 4 cluster this time around, the diff for those seems pretty manageable. The big" [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [11:27:13] (03PS7) 10Alexandros Kosiaris: services_proxy: Switch listen_ipv6 to true by default [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) [11:27:39] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236705 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:29:16] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:30:41] (03CR) 10Fabfur: "vtc tests ok" [puppet] - 10https://gerrit.wikimedia.org/r/1236705 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:31:28] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] services_proxy: Switch listen_ipv6 to true by default [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [11:31:40] (03PS2) 10JavierMonton: topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) [11:33:54] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [11:34:48] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [11:34:52] (03PS1) 10Majavah: dumps: web: Remove 2022 block due to bandwidth saturation [puppet] - 10https://gerrit.wikimedia.org/r/1237189 (https://phabricator.wikimedia.org/T317001) [11:39:33] (03CR) 10Ayounsi: "Tested with test-cookbook, works as expected." [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [11:39:53] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1005.eqiad.wmnet with reason: host reimage [11:40:33] (03CR) 10Btullis: [V:03+2 C:03+2] topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236302 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [11:42:57] (03CR) 10FNegri: [C:03+1] dumps: web: Remove 2022 block due to bandwidth saturation [puppet] - 10https://gerrit.wikimedia.org/r/1237189 (https://phabricator.wikimedia.org/T317001) (owner: 10Majavah) [11:43:02] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1005.eqiad.wmnet with reason: host reimage [11:45:30] (03CR) 10Majavah: [C:03+2] dumps: web: Remove 2022 block due to bandwidth saturation [puppet] - 10https://gerrit.wikimedia.org/r/1237189 (https://phabricator.wikimedia.org/T317001) (owner: 10Majavah) [11:46:34] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db1184: After schema change [11:47:47] (03CR) 10JavierMonton: topic: Flink enrichment pipeline (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [11:48:15] (03Merged) 10jenkins-bot: topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236302 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [11:49:03] (03PS1) 10Phuedx: ext.wikimediaEvents: Add code for synth-aaa-test-mw-js experiment code [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237190 [11:49:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237190 (owner: 10Phuedx) [11:51:45] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v.1.2.0 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237169 (owner: 10Santiago Faci) [11:52:46] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Build OpenGear serial port config from Netbox - https://phabricator.wikimedia.org/T415345#11586961 (10ayounsi) Alright, it's live on the prod Netbox instance : https://netbox.wikimedia.org/dcim/devices/1955/render-config/ [11:53:28] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v.1.2.0 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237169 (owner: 10Santiago Faci) [11:53:38] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:54:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [11:55:02] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [11:55:27] 06SRE, 10Charts, 07Kubernetes: Kserve helm chart - https://phabricator.wikimedia.org/T416580 (10DPogorzelski-WMF) 03NEW [11:55:38] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [11:56:41] (03PS2) 10Muehlenhoff: sre.cdn.roll-restart-reboot-ncredir: Fix one more syntax error [cookbooks] - 10https://gerrit.wikimedia.org/r/1235814 [11:57:00] 06SRE, 10Charts, 07Kubernetes: Kserve helm chart - https://phabricator.wikimedia.org/T416580#11587005 (10DPogorzelski-WMF) to be noted that we already use kserve in the ML context installed via: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_n... [11:58:58] (03CR) 10Vgutierrez: [C:03+1] cache::upload: enable global ratelimiting for bot (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1236705 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [12:01:08] (03PS1) 10Majavah: dumps: web: Trust X-Client-IP from edge caches [puppet] - 10https://gerrit.wikimedia.org/r/1237193 (https://phabricator.wikimedia.org/T306550) [12:01:10] (03PS1) 10Majavah: hieradata: Add dumps.wikimedia.org CDN mapping [puppet] - 10https://gerrit.wikimedia.org/r/1237194 (https://phabricator.wikimedia.org/T306550) [12:01:58] (03CR) 10CI reject: [V:04-1] hieradata: Add dumps.wikimedia.org CDN mapping [puppet] - 10https://gerrit.wikimedia.org/r/1237194 (https://phabricator.wikimedia.org/T306550) (owner: 10Majavah) [12:01:58] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1005.eqiad.wmnet with OS bookworm [12:02:08] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7985/co" [puppet] - 10https://gerrit.wikimedia.org/r/1237193 (https://phabricator.wikimedia.org/T306550) (owner: 10Majavah) [12:02:22] (03PS2) 10Majavah: hieradata: Add dumps.wikimedia.org CDN mapping [puppet] - 10https://gerrit.wikimedia.org/r/1237194 (https://phabricator.wikimedia.org/T306550) [12:05:12] (03CR) 10Vgutierrez: [C:03+1] "hmm this looks like a common mistake across traffic cookbooks, thx!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1235814 (owner: 10Muehlenhoff) [12:11:18] (03CR) 10Btullis: [C:03+1] "Looks good, although we will have to pre-create the S3/Swift user before it will work." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [12:12:53] (03CR) 10Muehlenhoff: [C:03+2] sre.cdn.roll-restart-reboot-ncredir: Fix one more syntax error [cookbooks] - 10https://gerrit.wikimedia.org/r/1235814 (owner: 10Muehlenhoff) [12:14:34] (03CR) 10Btullis: topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [12:15:24] (03CR) 10Btullis: topic: Flink enrichment pipeline (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [12:15:50] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236333 (owner: 10Muehlenhoff) [12:17:54] !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [12:18:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:18:58] !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [12:19:07] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1237200 (owner: 10L10n-bot) [12:21:47] (03CR) 10Btullis: topic: Flink enrichment pipeline (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [12:23:02] !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [12:23:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:24:04] (03PS3) 10JavierMonton: topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) [12:24:12] !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [12:26:13] !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [12:27:22] (03CR) 10Btullis: topic: Flink enrichment pipeline (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [12:27:30] !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [12:27:42] (03CR) 10JavierMonton: topic: Flink enrichment pipeline (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [12:28:05] (03PS1) 10Majavah: openstack: Fix puppetleaks script for openstack authentication changes [puppet] - 10https://gerrit.wikimedia.org/r/1237208 [12:29:02] (03PS4) 10JavierMonton: topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) [12:29:26] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:30:41] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:33:55] (03CR) 10Btullis: [C:03+1] topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [12:38:15] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:40:37] (03CR) 10Kamila Součková: [C:03+1] wikikube: decommission worker[2052-2054,2063,2079-2084,2096-2101].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1227431 (https://phabricator.wikimedia.org/T409103) (owner: 10Jasmine) [12:48:51] (03PS2) 10Kamila Součková: shellbox-video: Revert upsize now that backlog has cleared [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229161 (owner: 10Scott French) [12:48:51] (03CR) 10Kamila Součková: [C:03+1] "good to go now, thanks :-)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229161 (owner: 10Scott French) [12:49:39] (03PS1) 10Majavah: hieradata: tlsproxy::envoy: Default to listening on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1237215 (https://phabricator.wikimedia.org/T255568) [12:53:07] (03CR) 10Cathal Mooney: [C:03+1] "Nice work!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [12:56:01] (03CR) 10Kamila Součková: [C:03+1] wikikube: decommission wikikube-worker[2116-2123,2216-2241].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1227454 (https://phabricator.wikimedia.org/T409104) (owner: 10Jasmine) [12:56:20] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2204 gradually with 4 steps - After schema change [12:57:50] 06SRE, 10Prod-Kubernetes, 06serviceops: Kubernetes apiserver probe failures on restart - https://phabricator.wikimedia.org/T358936#11587287 (10akosiaris) 05Open→03Resolved a:03akosiaris Close to 2 years later, and with {T353464} done, I don't think we 've seen a recurrence. I 'll boldly resolve [12:59:17] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [12:59:44] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11587291 (10CDanis) >>! In T416452#11586518, @elukey wrote: > @CDanis Hi! I saw your name for otelcol and this is why I am reaching ou... [12:59:59] (03CR) 10Filippo Giunchedi: [C:03+1] Add dumps-http.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1237187 (https://phabricator.wikimedia.org/T306550) (owner: 10Majavah) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1300) [13:00:30] (03CR) 10Majavah: [C:03+2] Add dumps-http.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1237187 (https://phabricator.wikimedia.org/T306550) (owner: 10Majavah) [13:00:52] !log taavi@dns1004 START - running authdns-update [13:00:59] (03CR) 10Filippo Giunchedi: [C:03+1] dumps: web: Trust X-Client-IP from edge caches [puppet] - 10https://gerrit.wikimedia.org/r/1237193 (https://phabricator.wikimedia.org/T306550) (owner: 10Majavah) [13:02:01] (03CR) 10Majavah: [V:03+1 C:03+2] dumps: web: Trust X-Client-IP from edge caches [puppet] - 10https://gerrit.wikimedia.org/r/1237193 (https://phabricator.wikimedia.org/T306550) (owner: 10Majavah) [13:02:04] !log taavi@dns1004 END - running authdns-update [13:03:30] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:10:18] (03CR) 10Fabfur: [C:03+2] cache::upload: enable global ratelimiting for bot (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1236705 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:12:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11587357 (10VRiley-WMF) [13:12:38] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 35 NOOP 18): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-nod" [puppet] - 10https://gerrit.wikimedia.org/r/1237215 (https://phabricator.wikimedia.org/T255568) (owner: 10Majavah) [13:13:39] (03CR) 10Fabfur: "vtc tests ok" [puppet] - 10https://gerrit.wikimedia.org/r/1236706 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:25:47] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install frqueue2004 - https://phabricator.wikimedia.org/T416251#11587410 (10Jgreen) >>! In T416251#11584083, @Jhancock.wm wrote: > @Dwisehaupt or @Jgreen can i rack this in the new rack? or is this going in the og one at codfw? @Jhancock.wm this one goes in the sa... [13:29:31] 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592 (10Nicholusmuwonge_wmde) 03NEW [13:32:32] (03PS1) 10Volans: admin: update volans's ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1237220 [13:32:57] (03PS2) 10Volans: admin: update volans's ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1237220 [13:34:03] (03CR) 10Volans: "I've gpg-signed the commit, but it might not show up in Gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/1237220 (owner: 10Volans) [13:39:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [13:40:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [13:41:20] !log cmooney@cumin1003 START - Cookbook sre.network.provision for device fasw2-e16a-eqiad.mgmt.eqiad.wmnet [13:41:22] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [13:41:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2204 gradually with 4 steps - After schema change [13:42:21] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache fasw2-e16a-eqiad.mgmt.eqiad.wmnet on all recursors [13:42:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) fasw2-e16a-eqiad.mgmt.eqiad.wmnet on all recursors [13:42:29] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache fasw2-e16b-eqiad.mgmt.eqiad.wmnet on all recursors [13:42:33] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) fasw2-e16b-eqiad.mgmt.eqiad.wmnet on all recursors [13:43:25] !log cmooney@cumin1003 START - Cookbook sre.network.provision for device fasw2-e16b-eqiad.mgmt.eqiad.wmnet [13:43:41] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:43:42] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [13:43:53] (03PS1) 10Brouberol: airflow: ensure the ssh privatekey is b64 encoded in the Secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237221 (https://phabricator.wikimedia.org/T402512) [13:46:24] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:46:40] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [13:46:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1237220 (owner: 10Volans) [13:47:41] (03CR) 10Volans: [C:03+2] admin: update volans's ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1237220 (owner: 10Volans) [13:48:10] (03CR) 10Btullis: [C:03+1] airflow: ensure the ssh privatekey is b64 encoded in the Secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237221 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [13:49:19] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:50:40] (03CR) 10Kamila Součková: "LGTM except for the question inline, I'll +1 once rebased." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 (owner: 10Daniel Kinzler) [13:51:13] (03CR) 10Brouberol: [C:03+2] airflow: ensure the ssh privatekey is b64 encoded in the Secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237221 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [13:54:05] (03CR) 10Elukey: [C:03+1] airflow: ensure the ssh privatekey is b64 encoded in the Secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237221 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [13:54:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [13:55:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [13:56:13] 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11587507 (10Nicholusmuwonge_wmde) [13:59:56] (03CR) 10Elukey: reimage: use the freshest IP for DHCP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1400). [14:00:05] phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] o/ [14:00:29] o [14:00:32] o/ [14:01:08] phuedx: do you want to deploy your change? [14:01:40] (03CR) 10Muehlenhoff: [C:03+2] hadoop: Drop OS check for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1223183 (owner: 10Muehlenhoff) [14:01:51] Sure [14:02:19] (03CR) 10Filippo Giunchedi: "Can't meaningfully comment/vote on the extra section in trafficserver/backend.yaml, LGTM overall though" [puppet] - 10https://gerrit.wikimedia.org/r/1237194 (https://phabricator.wikimedia.org/T306550) (owner: 10Majavah) [14:02:57] (03PS3) 10Muehlenhoff: Remove puppetmaster role from puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/1230331 (https://phabricator.wikimedia.org/T365798) [14:02:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237190 (owner: 10Phuedx) [14:04:57] (03CR) 10Volans: [C:03+2] "As agreed on IRC, merging to test it on toolsbeta NFS workers with puppet disabled on tools NFS workers." [puppet] - 10https://gerrit.wikimedia.org/r/1231034 (https://phabricator.wikimedia.org/T415199) (owner: 10Volans) [14:04:59] (03Merged) 10jenkins-bot: ext.wikimediaEvents: Add code for synth-aaa-test-mw-js experiment code [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237190 (owner: 10Phuedx) [14:05:19] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1237190|ext.wikimediaEvents: Add code for synth-aaa-test-mw-js experiment code]] [14:06:19] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11587538 (10Eevans) [14:07:22] !log phuedx@deploy2002 phuedx: Backport for [[gerrit:1237190|ext.wikimediaEvents: Add code for synth-aaa-test-mw-js experiment code]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:16] (03CR) 10Brouberol: [C:03+1] Add turnilo-next and turnilo to wmnet/wm.org [dns] - 10https://gerrit.wikimedia.org/r/1236740 (https://phabricator.wikimedia.org/T416115) (owner: 10Joal) [14:10:44] (03CR) 10Brouberol: [C:03+1] "I'm going to +2 and apply as @joal@wikimedia.org does not have root access on the dns servers" [dns] - 10https://gerrit.wikimedia.org/r/1236740 (https://phabricator.wikimedia.org/T416115) (owner: 10Joal) [14:10:48] (03CR) 10Brouberol: [C:03+2] Add turnilo-next and turnilo to wmnet/wm.org [dns] - 10https://gerrit.wikimedia.org/r/1236740 (https://phabricator.wikimedia.org/T416115) (owner: 10Joal) [14:11:04] !log brouberol@dns1004 START - running authdns-update [14:11:23] Code is being delivered to the frontend. LGTM [14:12:13] !log brouberol@dns1004 END - running authdns-update [14:12:27] !log phuedx@deploy2002 phuedx: Continuing with sync [14:14:47] (03CR) 10Kamila Součková: [C:03+1] "Long-term it would be great to consider adding some proper tooling for service owners, I'll let that simmer in my head for a while." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229119 (owner: 10Daniel Kinzler) [14:16:34] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1237190|ext.wikimediaEvents: Add code for synth-aaa-test-mw-js experiment code]] (duration: 11m 14s) [14:16:40] (03CR) 10Kamila Součková: [C:03+1] rediscope: lower cpu and memoy limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233161 (owner: 10Daniel Kinzler) [14:21:50] !log UTC afternoon backport+config window done [14:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:19] (03PS1) 10Muehlenhoff: os-reports: Initial bullseye updates [puppet] - 10https://gerrit.wikimedia.org/r/1237234 [14:24:00] (03CR) 10Fabfur: [C:03+2] cache::upload: enable global ratelimiting for bot (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1236706 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:28:06] (03CR) 10Kamila Součková: [C:03+1] redioscope: enable time bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 (owner: 10Daniel Kinzler) [14:31:18] (03CR) 10Kamila Součková: [C:03+1] redioscope: enable time bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 (owner: 10Daniel Kinzler) [14:31:48] (03CR) 10JavierMonton: [C:03+2] topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [14:34:00] (03Merged) 10jenkins-bot: topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [14:35:06] 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11587651 (10WMDECyn) Approved from WMDE side [14:35:24] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Nicholusmuwonge - https://phabricator.wikimedia.org/T416494#11587662 (10WMDECyn) Approved from WMDE side [14:38:23] (03PS1) 10Bking: dse-k8s: add opensearch-semantic-search records [dns] - 10https://gerrit.wikimedia.org/r/1237236 (https://phabricator.wikimedia.org/T414703) [14:39:49] (03CR) 10Muehlenhoff: [C:03+2] os-reports: Initial bullseye updates [puppet] - 10https://gerrit.wikimedia.org/r/1237234 (owner: 10Muehlenhoff) [14:40:00] (03PS1) 10Muehlenhoff: Enable Bird 2.18 for cephosd/codfw [puppet] - 10https://gerrit.wikimedia.org/r/1237238 (https://phabricator.wikimedia.org/T413740) [14:40:00] (03CR) 10Brouberol: [C:03+1] dse-k8s: add opensearch-semantic-search records [dns] - 10https://gerrit.wikimedia.org/r/1237236 (https://phabricator.wikimedia.org/T414703) (owner: 10Bking) [14:40:33] (03CR) 10Bking: [C:03+2] dse-k8s: add opensearch-semantic-search records [dns] - 10https://gerrit.wikimedia.org/r/1237236 (https://phabricator.wikimedia.org/T414703) (owner: 10Bking) [14:41:37] !log bking@dns1004 START - running authdns-update [14:42:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237238 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [14:42:48] !log bking@dns1004 END - running authdns-update [14:43:23] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1237241 (https://phabricator.wikimedia.org/T406545) [14:43:25] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1237242 (https://phabricator.wikimedia.org/T406545) [14:43:26] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1237243 (https://phabricator.wikimedia.org/T406545) [14:43:28] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Nicholusmuwonge - https://phabricator.wikimedia.org/T416494#11587691 (10tappof) Hi @KFrancis, the user @Nicholusmuwonge_wmde needs a valid NDA, as he’s not listed in the spreadsheet. Thank you. [14:43:28] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1237244 (https://phabricator.wikimedia.org/T406545) [14:43:30] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1237245 (https://phabricator.wikimedia.org/T406545) [14:43:32] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1237246 (https://phabricator.wikimedia.org/T406545) [14:43:34] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1237247 (https://phabricator.wikimedia.org/T406545) [14:43:45] (03PS2) 10Fabfur: cache::upload: enable global ratelimiting (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1237241 (https://phabricator.wikimedia.org/T406545) [14:43:56] (03PS2) 10Fabfur: cache::upload: enable global ratelimiting (ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1237242 (https://phabricator.wikimedia.org/T406545) [14:44:07] (03PS2) 10Fabfur: cache::upload: enable global ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1237243 (https://phabricator.wikimedia.org/T406545) [14:44:16] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache opensearch-semantic-search.discovery.wmnet on all recursors [14:44:19] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) opensearch-semantic-search.discovery.wmnet on all recursors [14:46:16] (03PS1) 10Elukey: admin: add user lerickson to analytics-privatedata, wdqs-{root,admins} [puppet] - 10https://gerrit.wikimedia.org/r/1237248 (https://phabricator.wikimedia.org/T415406) [14:46:55] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#11587738 (10taavi) [14:47:00] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#11587740 (10taavi) [14:47:02] (03CR) 10Vgutierrez: [C:03+1] "vtcs looking good" [puppet] - 10https://gerrit.wikimedia.org/r/1237241 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:47:18] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#11587743 (10taavi) [14:48:13] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237241 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:52:28] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device fasw2-e16a-eqiad.mgmt.eqiad.wmnet [14:53:08] (03PS3) 10Ryan Kemper: opensearch-semantic-search-test: provision ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234593 (https://phabricator.wikimedia.org/T414702) [14:55:08] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device fasw2-e16b-eqiad.mgmt.eqiad.wmnet [14:55:42] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Access to systems owned by data platform engineering team for Jerry Wang - https://phabricator.wikimedia.org/T416191#11587761 (10tappof) 05Open→03Invalid Since @BTullis and @Dzahn have shared the in... [14:56:03] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11587782 (10elukey) [14:56:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1237248 (https://phabricator.wikimedia.org/T415406) (owner: 10Elukey) [14:59:16] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Nicholusmuwonge - https://phabricator.wikimedia.org/T416494#11587811 (10Nicholusmuwonge_wmde) Hey @tappof & @KFrancis ,FYI I signed the NDA yesterday :) [15:00:58] 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11587815 (10elukey) @Nicholusmuwonge_wmde Hi! Could you please sign the https://phabricator.wikimedia.org/L3 document? Hi @KFrancis, do we need an explicit ND... [15:01:26] 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11587837 (10tappof) Hello @Nicholusmuwonge_wmde, While we’re waiting for the NDA to be signed, could you please read and sign https://phabricator.wikimedia.org... [15:02:43] (03CR) 10Bking: [C:03+2] opensearch-semantic-search-test: provision ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234593 (https://phabricator.wikimedia.org/T414702) (owner: 10Ryan Kemper) [15:03:32] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [15:03:46] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 54728200 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:03:56] (03CR) 10Ayounsi: reimage: use the freshest IP for DHCP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [15:04:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237238 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [15:04:46] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3299616 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:04:46] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:05:00] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [15:06:20] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:06:32] (03CR) 10Brouberol: [C:03+1] openjdk-21-jdk: source image from new openjdk-21-jre image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235870 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [15:07:26] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237241 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [15:08:21] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11587860 (10tappof) Requested SSH key via Slack for an out-of-band check. [15:08:35] (03PS1) 10Volans: wmcs: infra-tracing-nfs bail out earlier if root [puppet] - 10https://gerrit.wikimedia.org/r/1237251 (https://phabricator.wikimedia.org/T415199) [15:09:03] (03CR) 10Bking: [C:03+2] opensearch-semantic-search-test: depl eqiad, codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234594 (https://phabricator.wikimedia.org/T414691) (owner: 10Ryan Kemper) [15:12:26] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:13:50] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [15:14:43] (03CR) 10Dzahn: "I had I756f780fbece3c6 but it's probably a straight duplicate." [puppet] - 10https://gerrit.wikimedia.org/r/1237055 (owner: 10Alexandros Kosiaris) [15:14:58] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Thu 05 Mar 2026 02:40:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [15:15:00] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:15:39] (03CR) 10Filippo Giunchedi: [C:03+1] wmcs: infra-tracing-nfs bail out earlier if root [puppet] - 10https://gerrit.wikimedia.org/r/1237251 (https://phabricator.wikimedia.org/T415199) (owner: 10Volans) [15:16:21] (03CR) 10Volans: [C:03+2] wmcs: infra-tracing-nfs bail out earlier if root [puppet] - 10https://gerrit.wikimedia.org/r/1237251 (https://phabricator.wikimedia.org/T415199) (owner: 10Volans) [15:16:50] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:17:17] (03CR) 10Fabfur: [C:03+2] cache::upload: enable global ratelimiting (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1237241 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [15:17:50] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:19:49] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [15:19:49] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster role from puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/1230331 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:21:45] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on fasw2-c1a-eqiad,fasw2-c1b-eqiad,pfw1-eqiad with reason: fundraising migration eqiad [15:21:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11587929 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5da72ec9-7626-47d2-bc98-a871f93d717e) set by cmooney@cumin1003 for 1 day, 0... [15:23:04] (03CR) 10Fabfur: "vtc tests ok" [puppet] - 10https://gerrit.wikimedia.org/r/1237242 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [15:23:55] (03Abandoned) 10Brouberol: druid_exporter: Fixup metric definition [puppet] - 10https://gerrit.wikimedia.org/r/1226766 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [15:25:00] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:25:57] !log deactivate BGP session from cr2-eqiad to pfw1b-eqiad fundraising migration T403035 [15:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:00] T403035: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035 [15:26:45] (03PS1) 10Dzahn: zookeeper: mTLS debugging, use TLSv.1.2, clientAuth=want, set alias to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1237253 (https://phabricator.wikimedia.org/T395938) [15:28:33] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetmaster2001.codfw.wmnet [15:28:56] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [15:29:01] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [15:29:05] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [15:29:17] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:29:19] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:29:20] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2*.codfw.wmnet: Applying upgrade to Java 11.0.30 — T416492 - eevans@cumin1003 [15:29:24] T416492: Cassandra restarts for Java 11.0.30 security update - https://phabricator.wikimedia.org/T416492 [15:29:33] (03PS2) 10Dzahn: zookeeper: mTLS debugging, use TLSv.1.2, clientAuth=want, set alias to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1237253 (https://phabricator.wikimedia.org/T395938) [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1530) [15:32:43] 06SRE, 10Charts, 07Kubernetes: Kserve helm chart - https://phabricator.wikimedia.org/T416580#11587988 (10tappof) p:05Triage→03Medium [15:32:58] (03PS1) 10Ladsgroup: Stop relying on ThumbRenderMap and use a standard size instead [extensions/MediaSearch] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237255 (https://phabricator.wikimedia.org/T415282) [15:33:14] (03PS1) 10Ladsgroup: Stop relying on ThumbRenderMap and use a standard size instead [extensions/MediaSearch] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1237256 (https://phabricator.wikimedia.org/T415282) [15:33:19] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:33:42] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [15:34:48] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission puppetmaster2001 - https://phabricator.wikimedia.org/T416606 (10MoritzMuehlenhoff) 03NEW [15:35:36] (03CR) 10Dpogorzelski: [C:03+2] ml: add vLLM 0.14 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1237060 (https://phabricator.wikimedia.org/T415627) (owner: 10Kevin Bazira) [15:35:41] (03PS1) 10Fabfur: Revert "cache::upload: enable global ratelimiting (magru)" [puppet] - 10https://gerrit.wikimedia.org/r/1237257 [15:36:15] (03CR) 10Elukey: reimage: use the freshest IP for DHCP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [15:36:41] (03CR) 10Fabfur: [C:03+2] Revert "cache::upload: enable global ratelimiting (magru)" [puppet] - 10https://gerrit.wikimedia.org/r/1237257 (owner: 10Fabfur) [15:36:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:37:55] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/1/7 (Core: pfw1-eqiad:xe-7/2/0 {#4027}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:40:05] jmm@cumin2002 decommission (PID 1645186) is awaiting input [15:40:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:40:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:40:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster2001.codfw.wmnet [15:40:34] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#11588030 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `puppetmaster2001.codfw.wmnet` - puppetmaster2001.... [15:41:21] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission puppetmaster2001 - https://phabricator.wikimedia.org/T416606#11588035 (10MoritzMuehlenhoff) [15:43:48] (03PS1) 10Jelto: deployment_server: add linked-artifacts kubeconfig files [puppet] - 10https://gerrit.wikimedia.org/r/1237258 (https://phabricator.wikimedia.org/T414112) [15:45:47] (03PS5) 10Ayounsi: reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) [15:46:12] (03CR) 10Ayounsi: reimage: use the freshest IP for DHCP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [15:46:24] (03CR) 10Jelto: "maybe let's separate the service catalog configuration and the Kubernetes service setup? I opened I1c092f52788b6b875ffae1936ef6dd36bc3747f" [puppet] - 10https://gerrit.wikimedia.org/r/1227851 (https://phabricator.wikimedia.org/T414112) (owner: 10Federico Ceratto) [15:47:01] (03PS6) 10Ayounsi: reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) [15:47:11] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2*.codfw.wmnet: Applying upgrade to Java 11.0.30 — T416492 - eevans@cumin1003 [15:47:15] T416492: Cassandra restarts for Java 11.0.30 security update - https://phabricator.wikimedia.org/T416492 [15:51:25] (03CR) 10Dzahn: [C:03+2] zookeeper: mTLS debugging, use TLSv.1.2, clientAuth=want, set alias to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1237253 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [15:53:25] (03CR) 10CI reject: [V:04-1] reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [15:55:07] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1005.eqiad.wmnet with OS bookworm [15:55:53] (03CR) 10Alexandros Kosiaris: [C:03+2] "went through all the keys one by one, this is safe to land." [puppet] - 10https://gerrit.wikimedia.org/r/1228583 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [15:56:55] cccccbukvgbciejvbgrfgvbnnhjnccubrkftcnhcuicd [15:58:27] FIRING: GnmiTargetDown: fasw2-e16a-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [15:58:28] (03PS7) 10Ayounsi: reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) [15:59:38] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1*.eqiad.wmnet: Applying upgrade to Java 11.0.30 — T416492 - eevans@cumin1003 [15:59:41] T416492: Cassandra restarts for Java 11.0.30 security update - https://phabricator.wikimedia.org/T416492 [16:00:05] hashar and brennen: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1600) [16:01:29] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [16:01:36] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [16:03:03] (03PS1) 10Hashar: gerrit: allow `replication` when in readonly mode [puppet] - 10https://gerrit.wikimedia.org/r/1237259 [16:03:56] (03CR) 10CI reject: [V:04-1] reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [16:04:54] (03CR) 10Hashar: "This should allow usage of the `replication list` command when the primary is in read-only mode. The config is trivial enough that we can " [puppet] - 10https://gerrit.wikimedia.org/r/1237259 (owner: 10Hashar) [16:05:11] (03PS8) 10Ayounsi: reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) [16:06:58] 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11588148 (10Nicholusmuwonge_wmde) [16:08:22] FIRING: [3x] GnmiTargetDown: cr2-eqord is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [16:08:30] FIRING: LibericaUnhealthyRealserverPooled: ... [16:08:30] Liberica service gerrit-sshlb_29418 has 1 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://grafana.wikimedia.org/d/d70d14db-4a71-414d-8425-7a30d7127ca6/liberica-services?orgId=1&var-site=drmrs&var-service=gerrit-sshlb_29418&var-instance=lvs6001 - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:11:49] (03CR) 10Elukey: reimage: use the freshest IP for DHCP (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [16:12:05] 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11588173 (10Nicholusmuwonge_wmde) >>! In T416592#11587836, @tappof wrote: > Hello @Nicholusmuwonge_wmde, > While we’re waiting for the NDA to be signed, could... [16:13:30] FIRING: [2x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb_29418 has 1 unhealthy realservers pooled on lvs6001:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:13:49] uh? :) [16:14:00] mutante: is that you? [16:14:07] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1005.eqiad.wmnet with reason: host reimage [16:17:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [16:17:27] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1*.eqiad.wmnet: Applying upgrade to Java 11.0.30 — T416492 - eevans@cumin1003 [16:17:30] T416492: Cassandra restarts for Java 11.0.30 security update - https://phabricator.wikimedia.org/T416492 [16:17:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [16:19:35] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1005.eqiad.wmnet with reason: host reimage [16:23:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:23:10] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [16:23:14] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [16:23:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb_29418 has 1 unhealthy realservers pooled on lvs5004:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:23:32] (03PS1) 10Alexandros Kosiaris: Revert "base::sysctl: Switch priority of the ubuntu-defaults stanza" [puppet] - 10https://gerrit.wikimedia.org/r/1237262 [16:23:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:23:53] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Revert "base::sysctl: Switch priority of the ubuntu-defaults stanza" [puppet] - 10https://gerrit.wikimedia.org/r/1237262 (owner: 10Alexandros Kosiaris) [16:24:06] (03CR) 10Btullis: [C:03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1237238 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [16:24:56] (03CR) 10Btullis: [C:03+2] Add dse-k8s-worker1023 to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1237176 (https://phabricator.wikimedia.org/T414216) (owner: 10Btullis) [16:25:51] jouncebot: nowandnext [16:25:51] For the next 0 hour(s) and 34 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1600) [16:25:51] In 0 hour(s) and 34 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1700) [16:28:30] FIRING: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb_29418 has 1 unhealthy realservers pooled on lvs4008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:28:40] (03CR) 10Muehlenhoff: [C:03+2] Enable Bird 2.18 for cephosd/codfw [puppet] - 10https://gerrit.wikimedia.org/r/1237238 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [16:31:13] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Access to systems owned by data platform engineering team for Jerry Wang - https://phabricator.wikimedia.org/T416191#11588282 (10Aklapper) The Phab frontpage displays W2984 listing systems. Ideally, onb... [16:33:30] FIRING: [10x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb_29418 has 1 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:36:01] (03PS1) 10Brouberol: growthbook: remove the proxy rules enabling internet access [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237264 (https://phabricator.wikimedia.org/T416609) [16:36:03] (03PS9) 10Ayounsi: reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) [16:37:06] !log deactivate BGP session from cr1-eqiad to pfw1a-eqiad fundraising migration T403035 [16:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:09] T403035: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035 [16:37:55] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/1/7 (Core: pfw1-eqiad:xe-7/2/0 {#4027}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:38:16] (03CR) 10Btullis: [C:03+1] "Nice. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237264 (https://phabricator.wikimedia.org/T416609) (owner: 10Brouberol) [16:38:30] FIRING: [9x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb_29418 has 1 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:38:37] (03CR) 10Elukey: [C:03+1] "Really nice thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [16:38:38] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1005.eqiad.wmnet with OS bookworm [16:40:08] (03CR) 10Brouberol: [C:03+2] growthbook: remove the proxy rules enabling internet access [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237264 (https://phabricator.wikimedia.org/T416609) (owner: 10Brouberol) [16:41:24] 10SRE-swift-storage, 06Commons: File disappeared from server - https://phabricator.wikimedia.org/T416617#11588369 (10Aklapper) @Yann: Please set #sre-swift-storage for missing files. [16:42:00] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [16:42:08] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [16:42:20] (03CR) 10Elukey: "Really nice one!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237264 (https://phabricator.wikimedia.org/T416609) (owner: 10Brouberol) [16:42:26] (03PS1) 10Vgutierrez: tcpproxy: Disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1237266 [16:42:54] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237266 (owner: 10Vgutierrez) [16:43:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [16:43:18] (03CR) 10Alexandros Kosiaris: [C:03+1] tcpproxy: Disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1237266 (owner: 10Vgutierrez) [16:43:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [16:43:27] (03CR) 10CDanis: [C:03+1] tcpproxy: Disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1237266 (owner: 10Vgutierrez) [16:43:51] (03CR) 10Dzahn: [C:03+1] tcpproxy: Disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1237266 (owner: 10Vgutierrez) [16:43:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:44:10] (03PS1) 10Alexandros Kosiaris: Revert^2 "base::sysctl: Switch priority of the ubuntu-defaults stanza" [puppet] - 10https://gerrit.wikimedia.org/r/1237268 [16:44:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:44:40] (03CR) 10Vgutierrez: [C:03+2] tcpproxy: Disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1237266 (owner: 10Vgutierrez) [16:46:13] (03CR) 10CI reject: [V:04-1] Revert^2 "base::sysctl: Switch priority of the ubuntu-defaults stanza" [puppet] - 10https://gerrit.wikimedia.org/r/1237268 (owner: 10Alexandros Kosiaris) [16:46:46] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host pki1002.eqiad.wmnet with OS bullseye [16:47:14] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host pki1002 [16:47:23] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [16:47:45] (03CR) 10Ladsgroup: [C:03+2] Stop relying on ThumbRenderMap and use a standard size instead [extensions/MediaSearch] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1237256 (https://phabricator.wikimedia.org/T415282) (owner: 10Ladsgroup) [16:47:49] (03CR) 10Ladsgroup: [C:03+2] Stop relying on ThumbRenderMap and use a standard size instead [extensions/MediaSearch] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237255 (https://phabricator.wikimedia.org/T415282) (owner: 10Ladsgroup) [16:48:30] FIRING: [9x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb_29418 has 1 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:48:51] (03Merged) 10jenkins-bot: Stop relying on ThumbRenderMap and use a standard size instead [extensions/MediaSearch] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1237256 (https://phabricator.wikimedia.org/T415282) (owner: 10Ladsgroup) [16:49:38] (03Merged) 10jenkins-bot: Stop relying on ThumbRenderMap and use a standard size instead [extensions/MediaSearch] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237255 (https://phabricator.wikimedia.org/T415282) (owner: 10Ladsgroup) [16:50:20] (03PS2) 10Alexandros Kosiaris: Revert^2 "base::sysctl: Switch priority of the ubuntu-defaults stanza" [puppet] - 10https://gerrit.wikimedia.org/r/1237268 [16:52:24] (03CR) 10CI reject: [V:04-1] Revert^2 "base::sysctl: Switch priority of the ubuntu-defaults stanza" [puppet] - 10https://gerrit.wikimedia.org/r/1237268 (owner: 10Alexandros Kosiaris) [16:53:10] ayounsi@cumin1003 reimage (PID 3264655) is awaiting input [16:53:18] (03PS7) 10Daniel Kinzler: rest gateway: define new limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 [16:53:21] (03PS3) 10Alexandros Kosiaris: Revert^2 "base::sysctl: Switch priority of the ubuntu-defaults stanza" [puppet] - 10https://gerrit.wikimedia.org/r/1237268 [16:53:26] 10SRE-swift-storage, 06Commons: File disappeared from server - https://phabricator.wikimedia.org/T416617#11588461 (10Yann) Actually there was a previous version, so the issue is not really important. https://upload.wikimedia.org/wikipedia/commons/7/7a/Canadian_National_Vimy_Memorial.JPG I guess there is a cach... [16:53:30] FIRING: [6x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb_29418 has 1 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:54:32] (03CR) 10CDanis: [C:03+1] Revert^2 "base::sysctl: Switch priority of the ubuntu-defaults stanza" [puppet] - 10https://gerrit.wikimedia.org/r/1237268 (owner: 10Alexandros Kosiaris) [16:55:31] (03CR) 10Alexandros Kosiaris: [C:03+2] Revert^2 "base::sysctl: Switch priority of the ubuntu-defaults stanza" [puppet] - 10https://gerrit.wikimedia.org/r/1237268 (owner: 10Alexandros Kosiaris) [16:55:32] (03CR) 10Vgutierrez: [C:03+1] Revert^2 "base::sysctl: Switch priority of the ubuntu-defaults stanza" [puppet] - 10https://gerrit.wikimedia.org/r/1237268 (owner: 10Alexandros Kosiaris) [16:57:55] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:xe-3/1/7 (Core: pfw1-eqiad:xe-0/2/0 {#4026}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:58:17] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host pki1002 - ayounsi@cumin1003" [16:58:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host pki1002 - ayounsi@cumin1003" [16:58:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:58:22] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache pki1002.eqiad.wmnet 44.32.64.10.in-addr.arpa 4.4.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:58:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki1002.eqiad.wmnet 44.32.64.10.in-addr.arpa 4.4.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:58:26] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pki1002 [16:58:30] FIRING: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb_29418 has 1 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:59:17] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [16:59:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pki1002 [16:59:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host pki1002 [17:00:05] jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:24] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1237256|Stop relying on ThumbRenderMap and use a standard size instead (T415282)]], [[gerrit:1237255|Stop relying on ThumbRenderMap and use a standard size instead (T415282)]] [17:01:27] T415282: MediaSearch should stop relying on render map config - https://phabricator.wikimedia.org/T415282 [17:03:14] (03PS1) 10Cathal Mooney: Network: data.yaml - rename frack-fundraising vlan [puppet] - 10https://gerrit.wikimedia.org/r/1237270 (https://phabricator.wikimedia.org/T403035) [17:03:16] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1237256|Stop relying on ThumbRenderMap and use a standard size instead (T415282)]], [[gerrit:1237255|Stop relying on ThumbRenderMap and use a standard size instead (T415282)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:03:30] FIRING: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb_29418 has 1 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [17:06:25] (03CR) 10Kamila Součková: rest gateway: define new limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 (owner: 10Daniel Kinzler) [17:08:30] RESOLVED: [8x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-sshlb_29418 has 1 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [17:08:39] \o/ [17:10:55] (03CR) 10BCornwall: [C:03+1] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1237137 (https://phabricator.wikimedia.org/T416554) (owner: 10Gerrit maintenance bot) [17:11:18] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [17:11:25] (03CR) 10BCornwall: [C:03+1] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1237139 (https://phabricator.wikimedia.org/T416555) (owner: 10Gerrit maintenance bot) [17:13:49] (03PS1) 10Majavah: P:toolforge: k8s: Provide list of API server CIDRs for Helm usage [puppet] - 10https://gerrit.wikimedia.org/r/1237273 (https://phabricator.wikimedia.org/T407852) [17:14:28] (03CR) 10CI reject: [V:04-1] P:toolforge: k8s: Provide list of API server CIDRs for Helm usage [puppet] - 10https://gerrit.wikimedia.org/r/1237273 (https://phabricator.wikimedia.org/T407852) (owner: 10Majavah) [17:14:30] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pki1002.eqiad.wmnet with reason: host reimage [17:14:54] (03PS2) 10Majavah: P:toolforge: k8s: Provide list of API server CIDRs for Helm usage [puppet] - 10https://gerrit.wikimedia.org/r/1237273 (https://phabricator.wikimedia.org/T407852) [17:15:28] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1237256|Stop relying on ThumbRenderMap and use a standard size instead (T415282)]], [[gerrit:1237255|Stop relying on ThumbRenderMap and use a standard size instead (T415282)]] (duration: 14m 04s) [17:15:32] T415282: MediaSearch should stop relying on render map config - https://phabricator.wikimedia.org/T415282 [17:16:13] (03PS2) 10Cathal Mooney: Network: data.yaml - rename frack-fundraising vlan [puppet] - 10https://gerrit.wikimedia.org/r/1237270 (https://phabricator.wikimedia.org/T403035) [17:18:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki1002.eqiad.wmnet with reason: host reimage [17:18:42] (03CR) 10Jgreen: [C:03+1] Network: data.yaml - rename frack-fundraising vlan [puppet] - 10https://gerrit.wikimedia.org/r/1237270 (https://phabricator.wikimedia.org/T403035) (owner: 10Cathal Mooney) [17:18:45] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11588560 (10bd808) >>! In T413634#11580566, @A_smart_kitten wrote: >>>! In T413634#11580561, @sbassett wrote: >> Anything else left to do here?... [17:23:49] (03Abandoned) 10Majavah: P:toolforge: k8s: Provide list of API server CIDRs for Helm usage [puppet] - 10https://gerrit.wikimedia.org/r/1237273 (https://phabricator.wikimedia.org/T407852) (owner: 10Majavah) [17:27:06] (03CR) 10Ayounsi: [C:03+1] Network: data.yaml - rename frack-fundraising vlan [puppet] - 10https://gerrit.wikimedia.org/r/1237270 (https://phabricator.wikimedia.org/T403035) (owner: 10Cathal Mooney) [17:30:58] (03PS1) 10Alexandros Kosiaris: liberica: Enable it in staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) [17:34:40] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [17:35:05] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pki1002.eqiad.wmnet with OS bullseye [17:35:25] (03CR) 10Ayounsi: [C:03+2] reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [17:35:34] 06SRE, 10LDAP-Access-Requests: Add Jacob Thwaites WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T416358#11588651 (10KFrancis) Hi all, the NDA is complete. Thanks! [17:36:52] (03CR) 10RLazarus: [C:03+2] sophroid: Fork app.generic.container template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235191 (owner: 10RLazarus) [17:37:09] (03PS2) 10Alexandros Kosiaris: liberica: Enable it in staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) [17:39:15] (03Merged) 10jenkins-bot: sophroid: Fork app.generic.container template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235191 (owner: 10RLazarus) [17:39:19] (03CR) 10Vgutierrez: liberica: Enable it in staging cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [17:39:31] (03CR) 10RLazarus: [C:03+2] sophroid: Combine our own volumeMounts with the ones from the template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235192 (owner: 10RLazarus) [17:39:39] (03CR) 10CI reject: [V:04-1] sophroid: Combine our own volumeMounts with the ones from the template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235192 (owner: 10RLazarus) [17:39:43] (03PS3) 10RLazarus: sophroid: Combine our own volumeMounts with the ones from the template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235192 [17:40:16] (03CR) 10RLazarus: sophroid: Combine our own volumeMounts with the ones from the template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235192 (owner: 10RLazarus) [17:40:22] (03Merged) 10jenkins-bot: reimage: use the freshest IP for DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1237151 (https://phabricator.wikimedia.org/T416401) (owner: 10Ayounsi) [17:40:27] (03CR) 10RLazarus: [C:03+2] sophroid: Combine our own volumeMounts with the ones from the template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235192 (owner: 10RLazarus) [17:40:54] (03PS3) 10Alexandros Kosiaris: k8s-staging: Switch to IPIP mode [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) [17:41:24] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Cookbook sre.hosts.reimage: DHCP snippet created with old IP when --move-vlan is used - https://phabricator.wikimedia.org/T416401#11588732 (10ayounsi) 05Open→03Resolved a:03ayounsi fixed. [17:42:30] (03Merged) 10jenkins-bot: sophroid: Combine our own volumeMounts with the ones from the template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235192 (owner: 10RLazarus) [17:42:42] (03PS4) 10RLazarus: sophroid: Move our custom arguments into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235193 [17:44:42] (03PS1) 10Ladsgroup: Stop pre-gen jobs altogether [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237279 (https://phabricator.wikimedia.org/T408062) [17:44:58] (03PS4) 10Alexandros Kosiaris: k8s-staging: Switch to IPIP mode [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) [17:44:58] (03PS1) 10Alexandros Kosiaris: k8s-staging: Set ipip_encapsulation in service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/1237280 (https://phabricator.wikimedia.org/T352956) [17:45:08] (03CR) 10RLazarus: [C:03+2] sophroid: Move our custom arguments into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235193 (owner: 10RLazarus) [17:45:22] (03CR) 10Alexandros Kosiaris: k8s-staging: Switch to IPIP mode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [17:45:29] (03PS2) 10Ladsgroup: Stop thumbnail pre-gen jobs altogether [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237279 (https://phabricator.wikimedia.org/T408062) [17:47:15] (03Merged) 10jenkins-bot: sophroid: Move our custom arguments into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235193 (owner: 10RLazarus) [17:50:24] (03CR) 10Vgutierrez: "this can't be merged or the realserver::ipip profile would attempt to perform mss clamping in the workers using eBPF. it needs to be patch" [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [17:51:15] 10SRE-swift-storage, 06Data-Persistence, 10MediaSearch, 10Thumbor, and 2 others: MediaSearch should stop relying on render map config - https://phabricator.wikimedia.org/T415282#11588761 (10Ladsgroup) 05Open→03Resolved [17:52:55] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:xe-3/1/7 (Core: pfw1-eqiad:xe-0/2/0 {#4026}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:59:13] (03PS3) 10Cathal Mooney: Network: data.yaml - rename frack-fundraising vlan [puppet] - 10https://gerrit.wikimedia.org/r/1237270 (https://phabricator.wikimedia.org/T403035) [18:00:05] bd808: It is that lovely time of the day again! You are hereby commanded to deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1800) [18:00:13] (03PS1) 10Dzahn: tcpproxy: rename to tcp-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1237283 [18:00:33] (03CR) 10CI reject: [V:04-1] tcpproxy: rename to tcp-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1237283 (owner: 10Dzahn) [18:01:50] (03PS2) 10Dzahn: tcpproxy: rename to tcp-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1237283 [18:01:56] (03CR) 10Cathal Mooney: [C:03+2] Network: data.yaml - rename frack-fundraising vlan [puppet] - 10https://gerrit.wikimedia.org/r/1237270 (https://phabricator.wikimedia.org/T403035) (owner: 10Cathal Mooney) [18:02:21] (03CR) 10CI reject: [V:04-1] tcpproxy: rename to tcp-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1237283 (owner: 10Dzahn) [18:17:41] Nothing to ship in my window today [18:26:33] (03CR) 10Daniel Kinzler: rest gateway: include service values.yaml when testing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229119 (owner: 10Daniel Kinzler) [18:27:00] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11588906 (10RobH) > WMF request for Dell USA - Help determining the part # for R450 power distribution board > > Dell Team, > > I have an odd request, so I'll give you the background first. We have... [18:30:37] (03CR) 10Reedy: [C:03+1] Stop thumbnail pre-gen jobs altogether [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237279 (https://phabricator.wikimedia.org/T408062) (owner: 10Ladsgroup) [18:31:35] (03CR) 10Daniel Kinzler: rest gateway: define new limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 (owner: 10Daniel Kinzler) [18:31:58] (03PS8) 10Daniel Kinzler: rest gateway: define new limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 [18:32:07] (03CR) 10Daniel Kinzler: rest gateway: define new limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 (owner: 10Daniel Kinzler) [18:34:17] FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:36:48] (03CR) 10BCornwall: prometheus: add depooled cp* host check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [18:43:33] (03PS1) 10Dzahn: zuul: create shared config dir for zookeeper-zuul mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1237294 (https://phabricator.wikimedia.org/T395938) [18:46:45] (03PS2) 10Dzahn: zuul: create shared config dir for zookeeper-zuul mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1237294 (https://phabricator.wikimedia.org/T395938) [18:49:06] 06SRE, 10SRE-Access-Requests: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde - https://phabricator.wikimedia.org/T416592#11588973 (10KFrancis) Hi all, I have sent the NDA out for signatures. I'll confirm when it's complete. Thanks! [18:49:22] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1237294/7992/zuul2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1237294 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:51:23] (03CR) 10Dzahn: [C:03+2] zuul: create shared config dir for zookeeper-zuul mTLS [puppet] - 10https://gerrit.wikimedia.org/r/1237294 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:51:36] (03CR) 10BCornwall: prometheus: add depooled cp* host check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [18:52:08] topranks: your data.yaml change is still pending on puppetserver [18:52:59] looking for help backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1236777 [18:53:12] (don't have rights yet, but I applied, so hopefully this'll be the last time) [18:54:03] milimetric: I can push buttons for you. Have you made the backport patch(es) yet? [18:55:47] no, just merged that [18:56:10] (03PS1) 10Daniel Kinzler: rest-gateway: remove suppotr for insecure user ID cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237295 (https://phabricator.wikimedia.org/T405578) [18:57:45] bd808: I'm not clear on what exactly I am allowed to do without the spiderpig rights, and I never deployed anything to MW myself [18:58:08] milimetric: learning curves! nice. [18:59:46] The backport patch part is as easy as using the "Cherry pick" in the kabob menu at the top right to create a patch on the target backport branch. [19:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1900) [19:00:16] Use https://versions.toolforge.org/ to figure out which branches you want to backport to. [19:01:30] milimetric: brennen has the train slot now, and I should eat lunch. Prep your cherry-picks in Gerrit and then one of us can help you after the train needful is managed. [19:02:19] re: train, double checking but i don't believe we have anything to do for this slot. [19:03:04] mutante: total brain fart sorry [19:03:45] yep, train's all good. milimetric, i'm available to deploy the backport. [19:03:53] mutante: is it ok to merge your zookeeper cahnge too? [19:04:45] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:06:55] topranks: yes, go ahead please [19:07:54] mutante: ok done, thanks for the heads up <3 [19:08:42] thanks [19:09:00] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change fasw2-c1 to fasw2-e15 to match new location - cmooney@cumin1003" [19:09:33] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change fasw2-c1 to fasw2-e15 to match new location - cmooney@cumin1003" [19:09:34] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:09:49] (03PS1) 10Milimetric: Collect data four ways to find discrepancies [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237299 (https://phabricator.wikimedia.org/T416472) [19:10:27] brennen: ^ that's my only patch, thanks so much [19:11:10] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [19:11:33] (03PS1) 10Cathal Mooney: Change name and parent for fasw2-c1x-eqiad switches, moved to rack e15 [puppet] - 10https://gerrit.wikimedia.org/r/1237300 (https://phabricator.wikimedia.org/T403035) [19:11:59] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [19:12:56] (03CR) 10Cathal Mooney: [C:03+2] Change name and parent for fasw2-c1x-eqiad switches, moved to rack e15 [puppet] - 10https://gerrit.wikimedia.org/r/1237300 (https://phabricator.wikimedia.org/T403035) (owner: 10Cathal Mooney) [19:13:12] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [19:13:19] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [19:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [19:15:17] (03PS1) 10Cathal Mooney: Fundraising move: add new fasw devices in rack e16 to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1237301 (https://phabricator.wikimedia.org/T403035) [19:15:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237299 (https://phabricator.wikimedia.org/T416472) (owner: 10Milimetric) [19:15:57] milimetric: going ahead with that. you should get pinged when it's on testservers. [19:16:20] <3 [19:16:57] meanwhile i note quite a few parsoid errors in logs. filing something for that. [19:17:22] (03PS2) 10Cathal Mooney: Fundraising move: add new fasw devices in rack e16 to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1237301 (https://phabricator.wikimedia.org/T403035) [19:18:19] (03Merged) 10jenkins-bot: Collect data four ways to find discrepancies [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237299 (https://phabricator.wikimedia.org/T416472) (owner: 10Milimetric) [19:18:39] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1237299|Collect data four ways to find discrepancies (T416472)]] [19:18:42] T416472: Send client signals in various ways to understand new data - https://phabricator.wikimedia.org/T416472 [19:18:52] (03CR) 10Cathal Mooney: [C:03+2] Fundraising move: add new fasw devices in rack e16 to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1237301 (https://phabricator.wikimedia.org/T403035) (owner: 10Cathal Mooney) [19:20:35] !log brennen@deploy2002 milimetric, brennen: Backport for [[gerrit:1237299|Collect data four ways to find discrepancies (T416472)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:20:39] (03CR) 10Kamila Součková: [C:03+1] rest gateway: define new limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 (owner: 10Daniel Kinzler) [19:20:57] (testing, thx) [19:21:08] cool, awaiting ping [19:22:13] PROBLEM - Host fasw2-e15a-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [19:22:13] PROBLEM - Host fasw2-e15b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [19:23:17] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [19:23:21] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [19:24:25] brennen: looks perfect, and seems to have not broken anything :) [19:24:32] cool, syncing. :) [19:24:37] !log brennen@deploy2002 milimetric, brennen: Continuing with sync [19:24:44] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on fasw2-e15a-eqiad,fasw2-e15b-eqiad with reason: fundraising migration eqiad [19:24:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11589122 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=785b501b-5e53-43b0-b903-5d93372eb8e1) set by cmooney@cumin1003 for 1 day, 0:00:00 on 2 host(s... [19:25:13] brennen: wait [19:25:30] sorry just realized something I think we all missed [19:26:10] yeah, wow, this needs a patch or it'll start sending on all traffic everywhere [19:28:43] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1237299|Collect data four ways to find discrepancies (T416472)]] (duration: 10m 03s) [19:28:46] T416472: Send client signals in various ways to understand new data - https://phabricator.wikimedia.org/T416472 [19:29:17] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:29:38] (03PS1) 10Bking: opensearch-ipoid: add cert SANs for non-discovery endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237303 (https://phabricator.wikimedia.org/T416345) [19:31:55] brennen: patch ready as https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1237304 [19:32:13] (terribly sorry, four of us looked at this and we all missed it) [19:33:05] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: define new limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 (owner: 10Daniel Kinzler) [19:34:11] (03PS1) 10Jdrewniak: Enable Extension:WP25EasterEggs on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237306 [19:34:26] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "rename fmsw1-c1-eqiad to fmsw1-e15-eqiad - cmooney@cumin1003 - T403035" [19:34:29] T403035: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035 [19:34:31] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "rename fmsw1-c1-eqiad to fmsw1-e15-eqiad - cmooney@cumin1003 - T403035" [19:35:08] (03Merged) 10jenkins-bot: rest gateway: define new limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 (owner: 10Daniel Kinzler) [19:35:56] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [19:36:00] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [19:36:51] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "rename fmsw1-c1-eqiad to fmsw1-e15-eqiad - cmooney@cumin1003 - T403035" [19:36:56] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "rename fmsw1-c1-eqiad to fmsw1-e15-eqiad - cmooney@cumin1003 - T403035" [19:38:26] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [19:38:30] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [19:39:46] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [19:39:54] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [19:40:55] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [19:41:32] (03PS1) 10Milimetric: Fix instrument to not send when not in sample [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237309 [19:41:42] (03CR) 10Milimetric: [C:03+2] Fix instrument to not send when not in sample [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237309 (owner: 10Milimetric) [19:42:12] brennen: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1237309 is the cherry-picked backport [19:43:44] (until that's deployed, 100% of traffic is sending 2 events on every page) [19:44:26] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [19:45:13] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [19:45:33] (03PS1) 10BCornwall: ncredir: Ignore wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1237310 (https://phabricator.wikimedia.org/T416629) [19:45:52] (03Merged) 10jenkins-bot: Fix instrument to not send when not in sample [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237309 (owner: 10Milimetric) [19:50:45] jouncebot: nowandnext [19:50:45] For the next 1 hour(s) and 9 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1900) [19:50:45] In 1 hour(s) and 9 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T2100) [19:51:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237306 (owner: 10Jdrewniak) [19:52:02] milimetric, brennen: I can deploy the change. AIUI the train ran this morning so we're clear [19:53:06] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1237309|Fix instrument to not send when not in sample]] [19:54:55] !log phuedx@deploy2002 phuedx, milimetric: Backport for [[gerrit:1237309|Fix instrument to not send when not in sample]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:55:08] milimetric: ^ [19:55:27] love [19:55:28] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [19:55:53] testing [19:56:40] phuedx: it stopped sending to enwiki [19:57:59] milimetric: Continue? [19:58:12] phuedx: as far as I can tell this fixes it and nothing else looks broken, go for it [19:58:19] and so many pints and loves thank you sorry [19:58:22] !log phuedx@deploy2002 phuedx, milimetric: Continuing with sync [20:02:26] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1237309|Fix instrument to not send when not in sample]] (duration: 09m 20s) [20:02:50] RECOVERY - Host fasw2-e15a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [20:03:32] milimetric: The deployment is finished. Please keep an eye on the event rate [20:03:36] RECOVERY - Host fasw2-e15b-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [20:06:41] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: DHCP failing for at least 2 ms-be servers in codfw - https://phabricator.wikimedia.org/T415189#11589253 (10ayounsi) I think that was due to the bug fixed in {T416401}. It should be good now. [20:08:38] FIRING: [3x] GnmiTargetDown: cr2-eqord is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [20:09:31] jouncebot: nowandnext [20:09:31] For the next 0 hour(s) and 50 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T1900) [20:09:31] In 0 hour(s) and 50 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T2100) [20:11:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237279 (https://phabricator.wikimedia.org/T408062) (owner: 10Ladsgroup) [20:12:28] (03Merged) 10jenkins-bot: Stop thumbnail pre-gen jobs altogether [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237279 (https://phabricator.wikimedia.org/T408062) (owner: 10Ladsgroup) [20:12:48] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1237279|Stop thumbnail pre-gen jobs altogether (T408062)]] [20:12:52] T408062: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062 [20:14:43] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1237279|Stop thumbnail pre-gen jobs altogether (T408062)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:15:14] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [20:15:18] (03PS1) 10Daniel Kinzler: rest-gateway: fix re-serialization of large numbers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237318 [20:16:25] going afk now but I'm around if anything crazy happens with these events again. In case nobody can reach me feel free to just revert any of my changes (goes without saying, always) [20:16:48] 10ops-eqiad, 06DC-Ops: eno1 on wikikube-worker1062:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T416635 (10phaultfinder) 03NEW [20:18:27] (03CR) 10Pppery: [C:03+1] ncredir: Ignore wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1237310 (https://phabricator.wikimedia.org/T416629) (owner: 10BCornwall) [20:19:17] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1237279|Stop thumbnail pre-gen jobs altogether (T408062)]] (duration: 06m 29s) [20:19:20] T408062: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062 [20:21:40] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: fix re-serialization of large numbers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237318 (owner: 10Daniel Kinzler) [20:24:19] (03Merged) 10jenkins-bot: rest-gateway: fix re-serialization of large numbers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237318 (owner: 10Daniel Kinzler) [20:26:18] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [20:27:13] (03CR) 10Dzahn: [C:03+1] ncredir: Ignore wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1237310 (https://phabricator.wikimedia.org/T416629) (owner: 10BCornwall) [20:28:42] (03CR) 10Dzahn: [C:03+1] "as long as the donate link keeps working" [puppet] - 10https://gerrit.wikimedia.org/r/1237310 (https://phabricator.wikimedia.org/T416629) (owner: 10BCornwall) [20:28:46] (03PS1) 10Daniel Kinzler: rest-gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237326 [20:28:51] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [20:28:55] (03CR) 10CI reject: [V:04-1] rest-gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237326 (owner: 10Daniel Kinzler) [20:29:57] (03PS2) 10Daniel Kinzler: rest-gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237326 [20:30:40] (03PS1) 10Dzahn: zuul: parameterize and configure new config dir for zookeeper-zuul [puppet] - 10https://gerrit.wikimedia.org/r/1237327 (https://phabricator.wikimedia.org/T395938) [20:31:10] (03CR) 10CI reject: [V:04-1] zuul: parameterize and configure new config dir for zookeeper-zuul [puppet] - 10https://gerrit.wikimedia.org/r/1237327 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:31:59] brennen, phuedx: thanks for helping milimetric [20:32:17] (03CR) 10Kamila Součková: [C:03+2] rest-gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237326 (owner: 10Daniel Kinzler) [20:32:33] bd808: <3 [20:32:57] (03CR) 10Bking: [C:03+1] "nit: I would add T416365 to the Bug: line and grab/close that ticket too" [alerts] - 10https://gerrit.wikimedia.org/r/1236852 (https://phabricator.wikimedia.org/T414306) (owner: 10Ryan Kemper) [20:34:21] (03Merged) 10jenkins-bot: rest-gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237326 (owner: 10Daniel Kinzler) [20:35:03] (03PS2) 10Dzahn: zuul: parameterize and configure new config dir for zookeeper-zuul [puppet] - 10https://gerrit.wikimedia.org/r/1237327 (https://phabricator.wikimedia.org/T395938) [20:35:29] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [20:35:33] (03CR) 10CI reject: [V:04-1] zuul: parameterize and configure new config dir for zookeeper-zuul [puppet] - 10https://gerrit.wikimedia.org/r/1237327 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:35:59] gah, my sincerest apologies for missing patch followup. too much multitask this afternoon. [20:36:25] major lapse of responsibility on my part. [20:36:39] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [20:36:50] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [20:39:46] (03PS1) 10DLynch: EditCheck: Adjust copy of experimental checks [extensions/VisualEditor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237329 [20:41:10] (03PS2) 10Ryan Kemper: wdqs: detune BlazegraphFailedServerRatioIncrease [alerts] - 10https://gerrit.wikimedia.org/r/1236852 (https://phabricator.wikimedia.org/T416365) [20:42:10] FIRING: BFDdown: BFD session down between cr4-ulsfo and 198.35.26.203 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:43:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237329 (owner: 10DLynch) [20:44:16] (03PS3) 10Dzahn: zuul: parameterize and configure new config dir for zookeeper-zuul [puppet] - 10https://gerrit.wikimedia.org/r/1237327 (https://phabricator.wikimedia.org/T395938) [20:45:07] (03PS1) 10DLynch: TextMatchEditCheck: Place 'dismiss' action last [extensions/VisualEditor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237331 [20:45:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237331 (owner: 10DLynch) [20:45:41] (03PS1) 10DLynch: TextMatch: allow links in descriptions [extensions/VisualEditor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237333 (https://phabricator.wikimedia.org/T416511) [20:45:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237333 (https://phabricator.wikimedia.org/T416511) (owner: 10DLynch) [20:46:39] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1237327/7993/zuul2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1237327 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:47:10] RESOLVED: BFDdown: BFD session down between cr4-ulsfo and 198.35.26.203 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:49:30] (03CR) 10Ryan Kemper: [C:03+2] wdqs: detune BlazegraphFailedServerRatioIncrease [alerts] - 10https://gerrit.wikimedia.org/r/1236852 (https://phabricator.wikimedia.org/T416365) (owner: 10Ryan Kemper) [20:50:40] (03Merged) 10jenkins-bot: wdqs: detune BlazegraphFailedServerRatioIncrease [alerts] - 10https://gerrit.wikimedia.org/r/1236852 (https://phabricator.wikimedia.org/T416365) (owner: 10Ryan Kemper) [20:51:09] (03PS3) 10Ryan Kemper: feat(WDQS)!: disable LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237142 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [20:51:09] (03PS2) 10Ryan Kemper: cleanup(WDQS/traffic): cleanup backend.yaml rules for WDQS LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237145 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [20:51:09] (03PS2) 10Ryan Kemper: cleanup(WDQS): remove monitoring for WDQS LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237146 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [20:51:09] (03PS2) 10Ryan Kemper: cleanup(WDQS): remove WDQS LDF endpoint from cfssl configuration [puppet] - 10https://gerrit.wikimedia.org/r/1237147 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [20:51:10] (03PS2) 10Ryan Kemper: cleanup(WDQS): remove all remaining references to the WDQS LDF endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1237148 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [20:51:51] (03CR) 10BCornwall: [C:03+2] "yep, no worries there." [puppet] - 10https://gerrit.wikimedia.org/r/1237310 (https://phabricator.wikimedia.org/T416629) (owner: 10BCornwall) [20:53:28] (03CR) 10Ryan Kemper: [C:03+1] feat(WDQS)!: disable LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237142 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [20:54:31] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1236859 (owner: 10Ncmonitor) [20:54:33] (03PS1) 10Daniel Kinzler: rest-gateway: make staging override ratelimit policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237335 [20:54:58] (03CR) 10Ryan Kemper: [C:03+1] cleanup(WDQS/traffic): cleanup backend.yaml rules for WDQS LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237145 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [20:55:06] (03CR) 10Ryan Kemper: [C:03+1] cleanup(WDQS): remove monitoring for WDQS LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237146 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [20:55:11] (03CR) 10Ryan Kemper: [C:03+1] cleanup(WDQS): remove WDQS LDF endpoint from cfssl configuration [puppet] - 10https://gerrit.wikimedia.org/r/1237147 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [20:55:16] (03CR) 10Ryan Kemper: [C:03+1] cleanup(WDQS): remove all remaining references to the WDQS LDF endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1237148 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [20:59:17] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T2100). [21:00:05] sfaci, jan_drewniak, and Kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:26] o/ I have three patches that can all be merged together. [21:01:32] o/ [21:01:33] (And can get it myself.) [21:02:15] I'll do sfaci's and mine [21:02:36] jan_drewniak: Thanks! [21:04:38] It's erroring because one of your changes is flagged WIP. [21:05:02] Or has a dependency that is, at least. [21:05:31] (03PS1) 10Kamila Součková: Revert "rest-gateway: bump chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237337 [21:06:32] Looks specifically like 1237171's depends-on is pointing to an abandoned patch. [21:07:05] (Which is the one that you're also trying to merge a cherry-pick of.) [21:07:47] (03Abandoned) 10Kamila Součková: Revert "rest-gateway: bump chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237337 (owner: 10Kamila Součková) [21:07:59] I guess we can remove that dependency, it's not a hard dependency actually [21:08:06] sfaci: yeah, I think we can remove the Depends-on: on that patch [21:08:13] I'll do it [21:08:26] (03PS2) 10Santiago Faci: Renaming `MetricsPlatform` => `TestKitchen` [extensions/ReadingLists] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237171 (https://phabricator.wikimedia.org/T414435) [21:08:51] I shouldn't have added it. In reality both patches can be merged/deployed independently [21:09:24] it's done. Pipeline is running already [21:09:27] (03PS1) 10Kamila Součková: Revert "rest-gateway: fix re-serialization of large numbers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237338 [21:09:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [extensions/ReadingLists] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237171 (https://phabricator.wikimedia.org/T414435) (owner: 10Santiago Faci) [21:09:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237170 (https://phabricator.wikimedia.org/T414435) (owner: 10Santiago Faci) [21:10:23] (03Abandoned) 10Kamila Součková: Revert "rest-gateway: fix re-serialization of large numbers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237338 (owner: 10Kamila Součková) [21:11:50] (03PS1) 10Kamila Součková: Revert "rest gateway: define new limits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237340 [21:11:59] (03CR) 10CI reject: [V:04-1] Revert "rest gateway: define new limits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237340 (owner: 10Kamila Součková) [21:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:12:31] (03PS1) 10Dzahn: zuul/zookeeper: use CA-only, not chained file, as truststore [puppet] - 10https://gerrit.wikimedia.org/r/1237341 (https://phabricator.wikimedia.org/T395938) [21:12:33] (03Merged) 10jenkins-bot: Renaming `MetricsPlatform` => `TestKitchen` [extensions/ReadingLists] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237171 (https://phabricator.wikimedia.org/T414435) (owner: 10Santiago Faci) [21:12:47] (03Merged) 10jenkins-bot: readingListAB.js: Updated to use mw.testKitchen [extensions/WikimediaEvents] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237170 (https://phabricator.wikimedia.org/T414435) (owner: 10Santiago Faci) [21:13:05] (03PS2) 10Dzahn: zuul/zookeeper: use CA-only, not chained file, as truststore [puppet] - 10https://gerrit.wikimedia.org/r/1237341 (https://phabricator.wikimedia.org/T395938) [21:13:09] !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1237171|Renaming `MetricsPlatform` => `TestKitchen` (T414435)]], [[gerrit:1237170|readingListAB.js: Updated to use mw.testKitchen (T414435)]] [21:13:12] T414435: [Renaming TestKitchen] Update ReadingList extension - https://phabricator.wikimedia.org/T414435 [21:14:49] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1237341/7994/zuul2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1237341 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [21:15:07] !log jdrewniak@deploy2002 sfaci, jdrewniak: Backport for [[gerrit:1237171|Renaming `MetricsPlatform` => `TestKitchen` (T414435)]], [[gerrit:1237170|readingListAB.js: Updated to use mw.testKitchen (T414435)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:16:53] sfaci: testing on debug servers... [21:16:56] change looks good to me on test wikipedia, with experiment override and i see events [21:17:14] aude: thank you! [21:17:20] !log jdrewniak@deploy2002 sfaci, jdrewniak: Continuing with sync [21:17:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [21:17:44] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [21:17:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:18:49] (03PS2) 10Kamila Součková: Revert "rest gateway: define new limits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237340 [21:21:24] (03CR) 10Kamila Součková: [C:03+2] Revert "rest gateway: define new limits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237340 (owner: 10Kamila Součková) [21:21:25] !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1237171|Renaming `MetricsPlatform` => `TestKitchen` (T414435)]], [[gerrit:1237170|readingListAB.js: Updated to use mw.testKitchen (T414435)]] (duration: 08m 16s) [21:21:28] T414435: [Renaming TestKitchen] Update ReadingList extension - https://phabricator.wikimedia.org/T414435 [21:22:06] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [21:22:11] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [21:22:27] ok doing mine now [21:22:30] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [21:22:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237306 (owner: 10Jdrewniak) [21:23:18] (03Merged) 10jenkins-bot: Revert "rest gateway: define new limits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237340 (owner: 10Kamila Součková) [21:23:24] baby globe is coming! [21:23:27] (03Merged) 10jenkins-bot: Enable Extension:WP25EasterEggs on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237306 (owner: 10Jdrewniak) [21:23:28] to test wikipedia [21:23:41] \o/ [21:23:44] !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1237306|Enable Extension:WP25EasterEggs on testwiki.]] [21:23:51] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [21:24:05] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [21:25:39] !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:1237306|Enable Extension:WP25EasterEggs on testwiki.]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:26:34] I see https://test.wikipedia.org/wiki/Special:CommunityConfiguration/WP25EasterEggs but I am not an admin [21:27:22] Looks like I am :D ok syncing. [21:27:26] !log jdrewniak@deploy2002 jdrewniak: Continuing with sync [21:31:29] !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1237306|Enable Extension:WP25EasterEggs on testwiki.]] (duration: 07m 45s) [21:31:49] ok done, Kemayo passing it to you [21:31:56] jan_drewniak: thanks! [21:32:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237329 (owner: 10DLynch) [21:32:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237331 (owner: 10DLynch) [21:32:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237333 (https://phabricator.wikimedia.org/T416511) (owner: 10DLynch) [21:43:50] (03Merged) 10jenkins-bot: EditCheck: Adjust copy of experimental checks [extensions/VisualEditor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237329 (owner: 10DLynch) [21:46:35] (03Merged) 10jenkins-bot: TextMatchEditCheck: Place 'dismiss' action last [extensions/VisualEditor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237331 (owner: 10DLynch) [21:46:36] (03Merged) 10jenkins-bot: TextMatch: allow links in descriptions [extensions/VisualEditor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237333 (https://phabricator.wikimedia.org/T416511) (owner: 10DLynch) [21:46:57] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1237329|EditCheck: Adjust copy of experimental checks]], [[gerrit:1237331|TextMatchEditCheck: Place 'dismiss' action last]], [[gerrit:1237333|TextMatch: allow links in descriptions (T416511)]] [21:47:00] T416511: TextMatchEditCheck: Add support for links in matchItem descriptions - https://phabricator.wikimedia.org/T416511 [21:47:21] (03PS1) 10Dzahn: zookeeper/zuul: use ssl.trustStore.password instead ssl.trustStore.passwordPath [puppet] - 10https://gerrit.wikimedia.org/r/1237342 (https://phabricator.wikimedia.org/T395938) [21:47:48] (03CR) 10CI reject: [V:04-1] zookeeper/zuul: use ssl.trustStore.password instead ssl.trustStore.passwordPath [puppet] - 10https://gerrit.wikimedia.org/r/1237342 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [21:48:51] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1237329|EditCheck: Adjust copy of experimental checks]], [[gerrit:1237331|TextMatchEditCheck: Place 'dismiss' action last]], [[gerrit:1237333|TextMatch: allow links in descriptions (T416511)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:49:16] (03PS2) 10Dzahn: zookeeper/zuul: use ssl.trustStore.password instead ssl.trustStore.passwordPath [puppet] - 10https://gerrit.wikimedia.org/r/1237342 (https://phabricator.wikimedia.org/T395938) [21:49:22] (03PS1) 10Dzahn: zookeeper: add fake TLS password to match private repo [labs/private] - 10https://gerrit.wikimedia.org/r/1237343 (https://phabricator.wikimedia.org/T395938) [21:49:51] (03PS2) 10Dzahn: zookeeper: add fake TLS password to match private repo [labs/private] - 10https://gerrit.wikimedia.org/r/1237343 (https://phabricator.wikimedia.org/T395938) [21:50:47] (03PS3) 10Dzahn: zookeeper: add fake TLS password to match private repo [labs/private] - 10https://gerrit.wikimedia.org/r/1237343 (https://phabricator.wikimedia.org/T395938) [21:51:05] (03CR) 10Dzahn: [V:03+2 C:03+2] zookeeper: add fake TLS password to match private repo [labs/private] - 10https://gerrit.wikimedia.org/r/1237343 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [21:51:06] !log kemayo@deploy2002 kemayo: Continuing with sync [21:55:16] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1237329|EditCheck: Adjust copy of experimental checks]], [[gerrit:1237331|TextMatchEditCheck: Place 'dismiss' action last]], [[gerrit:1237333|TextMatch: allow links in descriptions (T416511)]] (duration: 08m 19s) [21:55:20] T416511: TextMatchEditCheck: Add support for links in matchItem descriptions - https://phabricator.wikimedia.org/T416511 [21:57:39] (03PS1) 10Dzahn: zuul: fix renamed password variable [labs/private] - 10https://gerrit.wikimedia.org/r/1237345 (https://phabricator.wikimedia.org/T395938) [22:00:02] (03PS2) 10Dzahn: zuul: fix renamed password variable [labs/private] - 10https://gerrit.wikimedia.org/r/1237345 (https://phabricator.wikimedia.org/T395938) [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260205T2200) [22:00:21] preparing to do a security deploy [22:00:58] (03Abandoned) 10Dzahn: zuul/zookeeper: debug (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1236845 (owner: 10Dzahn) [22:01:27] (03CR) 10Dzahn: [V:03+2 C:03+2] zuul: fix renamed password variable [labs/private] - 10https://gerrit.wikimedia.org/r/1237345 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:03:09] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1237342/7996/zuul1001.eqiad.wmnet/change.zuul1001.eqiad.wmnet.pson.gz" [puppet] - 10https://gerrit.wikimedia.org/r/1237342 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:10:56] PROBLEM - Host an-worker1187 is DOWN: PING CRITICAL - Packet loss = 100% [22:13:15] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1237342/7997/" [puppet] - 10https://gerrit.wikimedia.org/r/1237342 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:14:04] preparing to run scap [22:15:58] RECOVERY - Host an-worker1187 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [22:24:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:27:22] PROBLEM - Host an-worker1187 is DOWN: PING CRITICAL - Packet loss = 100% [22:29:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:29:22] RECOVERY - SSH on an-worker1187 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:29:24] RECOVERY - Host an-worker1187 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [22:29:28] canaray checks are failing [22:30:07] maryum: yes, that's probably the same underlying cause as the MediaWikiHighErrorRate alert above [22:30:36] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1187 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [22:30:39] I think so [22:32:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [22:32:44] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [22:32:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:33:14] FIRING: KubernetesDeploymentUnavailableReplicas: ... [22:33:14] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [22:33:14] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:34:17] FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:37:59] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [22:38:10] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [22:40:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [22:40:13] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [22:47:59] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [22:47:59] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [22:48:02] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:51:21] reverting my change since that caused a spike in errors [22:54:19] thanks maryum! [22:55:06] swfrench-wmf: so this is all canary error traffic? maryum’s core security patch is not on .14 on deployment fwict... [22:55:46] and i don’t know why this would have happened since she was deploying a patch to an unrelated extension that doesn’t touch RevisionRecord at all... [22:55:55] so, it definitely made it into the image that was deployed to canary [22:56:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11589899 (10RKemper) Alright, entered the emergency shell, added a virtual device (`/dev/sdm`) for the new drive located at `252:8`, formatted/partitioned/... [22:56:40] (03PS1) 10Dzahn: zuul-web: bind mount /etc/zookeeper/zuul-tls [puppet] - 10https://gerrit.wikimedia.org/r/1237354 [22:56:48] I don't know enough about how the security deployment process works to understand how the core patch may have been picked up as well, but it certainly seems to be there [22:56:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11589902 (10RKemper) 05Open→03Resolved [22:57:57] errors have stopped as of 22:53. MediaWikiHighErrorRate should resolve shortly. [22:58:35] (03PS4) 10Bking: DO NOT MERGE: opensearch-ipoid: add cert SANs for non-discovery endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237303 (https://phabricator.wikimedia.org/T416345) [22:59:05] ok, maryum did a sync world. and the bad patch was applied. so we need to get that patch removed and then re-sync. [22:59:13] (03Abandoned) 10Bking: DO NOT MERGE: opensearch-ipoid: add cert SANs for non-discovery endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237303 (https://phabricator.wikimedia.org/T416345) (owner: 10Bking) [22:59:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:59:17] !log Deployed security fix for T416502 [22:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:32] MediaWikiHighErrorRate resolved \o/ [23:00:10] here and paying attention if you need anything from oncall btw, I'm just not jumping in since it seems like things are going okay :) [23:00:24] thanks, r.zl! [23:02:08] i think we’ve got it figured out now. the bad patch has been removed and the errors have resolved. but we need to fix the bad patch and then try a re-deploy at some point. [23:02:14] thanks all for the support... [23:02:30] (03PS19) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [23:04:18] going to fix the core patch and try the deploy again, hopefully error free [23:14:30] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7998/console" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [23:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [23:20:58] deploying no errors [23:21:26] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11589974 (10Tacsipacsi) #mediaviewer got broken by this. ☹ For example, https://commons.wikimedia.org/wiki/Category:F%C5%91_Street_15... [23:28:21] !log Deployed security fix for T410429 [23:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:30] security deploy finished [23:29:17] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:29:21] all looks good from here, thanks maryum [23:33:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [23:33:44] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [23:33:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:38:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [23:38:44] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [23:38:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:48:58] (03CR) 10Aaron Schulz: "Looks like I forgot about this. It would be nice to clean this up, especially in case the db lists get outdated." [puppet] - 10https://gerrit.wikimedia.org/r/1210631 (owner: 10Aaron Schulz) [23:50:40] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [23:50:41] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [23:50:50] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster