[00:07:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1193578 [00:07:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1193578 (owner: 10TrainBranchBot) [00:26:50] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1193578 (owner: 10TrainBranchBot) [00:36:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [00:47:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:56:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/5 (Transit: NTT (345038) {#345038}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:00:39] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:11:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/5 (Transit: NTT (345038) {#345038}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:15:11] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 14m 31s) [01:36:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:41:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:41:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [02:24:53] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:39:53] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:12:11] (03PS3) 10Arnaudb: gerrit: add a local backup cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) [05:16:48] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 2 (gerrit1003, ...), Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:29:10] (03CR) 10Arnaudb: "runs with test-cookbook are going well" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [05:34:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:53] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:38] (03PS1) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/1193595 (https://phabricator.wikimedia.org/T405804) [05:41:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:41:42] (03CR) 10Marostegui: [C:03+2] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/1193595 (https://phabricator.wikimedia.org/T405804) (owner: 10Marostegui) [05:41:46] !log marostegui@dns1006 START - running authdns-update [05:42:06] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11244695 (10Marostegui) This proxy was active in m1, I've promoted dbproxy1022 instead so we can deal with this with no risks. [05:43:10] !log marostegui@dns1006 END - running authdns-update [05:57:49] (03PS1) 10Giuseppe Lavagetto: Upgrade version with a few bugfixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1193598 [05:58:09] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Upgrade version with a few bugfixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1193598 (owner: 10Giuseppe Lavagetto) [06:11:29] (03PS2) 10Arnaudb: gerrit: mod_qos tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1193597 (https://phabricator.wikimedia.org/T406403) [06:11:57] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Upgrade with minor comsmetic tweaks - oblivian@cumin1003" [06:11:58] (03PS2) 10Arnaudb: gerrit: remove localbackup logic from failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1193599 (https://phabricator.wikimedia.org/T387833) [06:11:59] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Upgrade with minor comsmetic tweaks - oblivian@cumin1003 [06:12:51] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Upgrade with minor comsmetic tweaks - oblivian@cumin1003 [06:12:52] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Upgrade with minor comsmetic tweaks - oblivian@cumin1003" [06:16:50] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:17:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193520 (https://phabricator.wikimedia.org/T401466) (owner: 10Kosta Harlan) [06:19:04] (03Merged) 10jenkins-bot: UserInfoCard: Hide reverted edit count if user has more than 1,000 edits [extensions/CheckUser] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193520 (https://phabricator.wikimedia.org/T401466) (owner: 10Kosta Harlan) [06:19:42] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1193520|UserInfoCard: Hide reverted edit count if user has more than 1,000 edits (T401466)]] [06:19:45] T401466: Incorrect number of reverted edits in UserInfoCard - https://phabricator.wikimedia.org/T401466 [06:24:53] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:25:06] <_joe_> kostajh: It would generally really help if people respected the deployment calendar [06:25:51] _joe_: can you clarify, please? AIUI, it is fine to self-serve deploy outside of the scheduled windows [06:27:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193091 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [06:27:44] <_joe_> kostajh: that's not my understanding, and if that's the consensus, then we need to reconsider it. For multiple reasons, including the fact that it might interfere with Other work. For instance, this hour before backports used to (somehow it disappeared) reserved for SRE to deploy infra changes, given now infra and code changes go via the same [06:27:44] <_joe_> process [06:28:20] _joe_: yeah, I looked at the calendar, and saw nothing before the window that starts in 30 minutes. [06:29:03] <_joe_> yeah that's strange, we had a UTC early mw infra window; but even besides it, I've always done my changes during backport windows even if I could obviously self-serve [06:29:20] <_joe_> outside of emergencies, ofc [06:29:43] <_joe_> I think having scheduled deployment windows helps with coordination, btu we can ask a clarification to thcipriani [06:34:40] _joe_: I sent you some links to internal discussions about this. The main reason I started early is because the backport window already has several other patches scheduled, and it speeds things along to start some of these earlier. also, if you show up as the only person who can deploy, you can lose an hour of your day handling deployments [06:35:45] <_joe_> Well I wouldn't qualify deploying patches for everyone as "losing an hour of my day", but that's kind of besides the point. [06:36:33] Depends on what you need to do in your day :) [06:38:20] (03PS1) 10Muehlenhoff: Record LDAP access for jmonton [puppet] - 10https://gerrit.wikimedia.org/r/1193643 [06:40:06] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for jmonton [puppet] - 10https://gerrit.wikimedia.org/r/1193643 (owner: 10Muehlenhoff) [06:46:18] (03PS1) 10Muehlenhoff: Record LDAP access for a-pizzata [puppet] - 10https://gerrit.wikimedia.org/r/1193699 [06:47:22] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1193520|UserInfoCard: Hide reverted edit count if user has more than 1,000 edits (T401466)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [06:47:25] T401466: Incorrect number of reverted edits in UserInfoCard - https://phabricator.wikimedia.org/T401466 [06:48:34] (03PS2) 10Muehlenhoff: Record LDAP access for a-pizzata [puppet] - 10https://gerrit.wikimedia.org/r/1193699 [06:49:20] !log kharlan@deploy2002 kharlan: Continuing with sync [06:51:18] (03PS1) 10Kosta Harlan: UserInfoCard: Hide new articles count when likely to be inaccurate [extensions/CheckUser] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193700 (https://phabricator.wikimedia.org/T399096) [06:51:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CheckUser] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193700 (https://phabricator.wikimedia.org/T399096) (owner: 10Kosta Harlan) [06:57:05] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for a-pizzata [puppet] - 10https://gerrit.wikimedia.org/r/1193699 (owner: 10Muehlenhoff) [06:59:58] (03CR) 10Jelto: [C:04-1] "thanks for preparing and testing this change. From the GitLab side this looks good to me. @abran@wikimedia.org does this also makes sense " [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [07:00:04] Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T0700). [07:00:05] kostajh, Cappybaraa, hamishcz, and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] hi [07:00:33] !log rebalance Ganeti eqiad/A following vmscape reboots [07:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:40] o/ [07:00:48] I'm finishing up a deployment right now [07:02:18] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193520|UserInfoCard: Hide reverted edit count if user has more than 1,000 edits (T401466)]] (duration: 42m 35s) [07:02:21] T401466: Incorrect number of reverted edits in UserInfoCard - https://phabricator.wikimedia.org/T401466 [07:03:21] and...hi im here [07:04:16] dcausse: do you want to sync your change? can you also do the one for Hamishcz_ ? I had a look at it, and it seems fine [07:04:41] kostajh: sure I can take of the deploys of this window [07:04:45] *care [07:05:50] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1193423 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [07:06:01] (03PS1) 10Muehlenhoff: url_downloader: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1193701 (https://phabricator.wikimedia.org/T405631) [07:06:53] Hamishcz_: o/ going to ship your change, seems like Cappybaraa is not around yet. [07:07:18] kostajh: double checking, you're done with your deploy? [07:07:38] dcausse: I still have some patches in this window. Shall I finish them, and then get out of your way? [07:08:26] kostajh: I can wait no problem [07:08:31] dcausse: thanks [07:09:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193188 (https://phabricator.wikimedia.org/T404622) (owner: 10Kosta Harlan) [07:09:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193700 (https://phabricator.wikimedia.org/T399096) (owner: 10Kosta Harlan) [07:09:36] (03PS1) 10Seanleong-wmde: Pilot wiki for Visual Changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) [07:10:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193701 (https://phabricator.wikimedia.org/T405631) (owner: 10Muehlenhoff) [07:17:08] dcausse: any progress? or i missed something? [07:17:48] Hamishcz_: oh sorry, I pinged you a bit too early, Kosta is still deploying some patches [07:19:03] ah ok its fine, and this is patch is okay to directly sync to the world, IMO [07:19:25] ok [07:19:36] just an FYI if i don't reply your msg later [07:19:48] thank you :) [07:19:51] np! [07:19:57] (03PS1) 10Fabfur: Add parsoid ua to ua_wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1193708 [07:20:11] (03Merged) 10jenkins-bot: Implement AuthPreserveQueryParams for Metrics Platform mpo param [extensions/MetricsPlatform] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193188 (https://phabricator.wikimedia.org/T404622) (owner: 10Kosta Harlan) [07:20:12] (03Merged) 10jenkins-bot: UserInfoCard: Hide new articles count when likely to be inaccurate [extensions/CheckUser] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193700 (https://phabricator.wikimedia.org/T399096) (owner: 10Kosta Harlan) [07:20:41] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1193188|Implement AuthPreserveQueryParams for Metrics Platform mpo param (T404622)]], [[gerrit:1193700|UserInfoCard: Hide new articles count when likely to be inaccurate (T399096)]] [07:20:46] T404622: Preserve mpo query parameter in auth flows - https://phabricator.wikimedia.org/T404622 [07:20:47] T399096: UserInfoCard: Number of "New articles" incorrect - https://phabricator.wikimedia.org/T399096 [07:21:07] (03CR) 10Giuseppe Lavagetto: [C:04-1] "Please change the ACL name." [puppet] - 10https://gerrit.wikimedia.org/r/1193708 (owner: 10Fabfur) [07:21:49] (03CR) 10Kavaljeet Singh: "Scheduled Deployment is done please check" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [07:26:53] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1193188|Implement AuthPreserveQueryParams for Metrics Platform mpo param (T404622)]], [[gerrit:1193700|UserInfoCard: Hide new articles count when likely to be inaccurate (T399096)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:26:57] T404622: Preserve mpo query parameter in auth flows - https://phabricator.wikimedia.org/T404622 [07:26:58] T399096: UserInfoCard: Number of "New articles" incorrect - https://phabricator.wikimedia.org/T399096 [07:28:09] (03CR) 10Fabfur: Add parsoid ua to ua_wdqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193708 (owner: 10Fabfur) [07:28:31] (03PS2) 10Fabfur: Add parsoid ua to ua_wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1193708 (https://phabricator.wikimedia.org/T400119) [07:30:18] !log kharlan@deploy2002 kharlan: Continuing with sync [07:31:54] (03CR) 10Fabfur: [C:03+2] Add parsoid ua to ua_wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1193708 (https://phabricator.wikimedia.org/T400119) (owner: 10Fabfur) [07:32:04] !log rebalance Ganeti codfw/A following vmscape reboots [07:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:45] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193188|Implement AuthPreserveQueryParams for Metrics Platform mpo param (T404622)]], [[gerrit:1193700|UserInfoCard: Hide new articles count when likely to be inaccurate (T399096)]] (duration: 14m 04s) [07:34:50] T404622: Preserve mpo query parameter in auth flows - https://phabricator.wikimedia.org/T404622 [07:34:50] T399096: UserInfoCard: Number of "New articles" incorrect - https://phabricator.wikimedia.org/T399096 [07:35:12] last one... [07:36:11] (03PS3) 10Kosta Harlan: MetricsPlatformAuthPreserveQueryParamsExperiments: Define hCaptcha A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193096 (https://phabricator.wikimedia.org/T405239) [07:36:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193096 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [07:37:15] (03Merged) 10jenkins-bot: MetricsPlatformAuthPreserveQueryParamsExperiments: Define hCaptcha A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193096 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [07:37:34] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1193096|MetricsPlatformAuthPreserveQueryParamsExperiments: Define hCaptcha A/B test (T405239)]] [07:37:38] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [07:39:53] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:43:54] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1193096|MetricsPlatformAuthPreserveQueryParamsExperiments: Define hCaptcha A/B test (T405239)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:43:57] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [07:44:53] !log kharlan@deploy2002 kharlan: Continuing with sync [07:45:14] dcausse: https://spiderpig.wikimedia.org/jobs/695 is syncing, once that's done, I'm finished as well. thanks for your patience! [07:45:28] kostajh: np! [07:49:16] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193096|MetricsPlatformAuthPreserveQueryParamsExperiments: Define hCaptcha A/B test (T405239)]] (duration: 11m 42s) [07:49:20] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [07:51:07] ok starting to ship Hamishcz_ patch and mine as well [07:52:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193502 (https://phabricator.wikimedia.org/T406220) (owner: 10Hamish) [07:52:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193091 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:53:02] (03Merged) 10jenkins-bot: Allow AbuseFilter to block on ganwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193502 (https://phabricator.wikimedia.org/T406220) (owner: 10Hamish) [07:53:04] (03Merged) 10jenkins-bot: cirrus: test completion with default sort on simplewiki [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193091 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:53:23] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1193502|Allow AbuseFilter to block on ganwiki (T406220)]], [[gerrit:1193091|cirrus: test completion with default sort on simplewiki [1/3] (T404858)]] [07:53:27] T406220: Allow block by abuse filter on Gan Wikipedia - https://phabricator.wikimedia.org/T406220 [07:53:28] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [07:57:08] jouncebot: next [07:57:09] In 2 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1000) [07:58:26] I might need to extend the current window by ~10mins to finish the current deploy [08:00:14] !log dcausse@deploy2002 hamishz, dcausse: Backport for [[gerrit:1193502|Allow AbuseFilter to block on ganwiki (T406220)]], [[gerrit:1193091|cirrus: test completion with default sort on simplewiki [1/3] (T404858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:00:19] T406220: Allow block by abuse filter on Gan Wikipedia - https://phabricator.wikimedia.org/T406220 [08:00:19] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:01:48] !log dcausse@deploy2002 hamishz, dcausse: Continuing with sync [08:06:11] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193502|Allow AbuseFilter to block on ganwiki (T406220)]], [[gerrit:1193091|cirrus: test completion with default sort on simplewiki [1/3] (T404858)]] (duration: 12m 48s) [08:06:21] T406220: Allow block by abuse filter on Gan Wikipedia - https://phabricator.wikimedia.org/T406220 [08:06:21] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:07:08] (03PS1) 10Fabfur: Add ua_internals policy [puppet] - 10https://gerrit.wikimedia.org/r/1193782 (https://phabricator.wikimedia.org/T400119) [08:07:15] !log closing the UTC morning backport window [08:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:30] (03CR) 10DCausse: "Please re-schedule this change, it was not deployed during the scheduled window (the person requesting the deploy did not show up during t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [08:09:45] !log installing OpenSSL security updates on trixie/bookworm [08:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [08:21:41] (03CR) 10Jelto: [C:03+1] "should be good to test new thresholds" [puppet] - 10https://gerrit.wikimedia.org/r/1193597 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [08:29:44] (03PS1) 10Slyngshede: site.pp add new test host for CAS [puppet] - 10https://gerrit.wikimedia.org/r/1193784 (https://phabricator.wikimedia.org/T406455) [08:33:40] (03CR) 10Muehlenhoff: site.pp add new test host for CAS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193784 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [08:35:31] (03CR) 10FNegri: [C:03+1] P:toolforge::k8s::haproxy: Use http-after-response for headers [puppet] - 10https://gerrit.wikimedia.org/r/1193451 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [08:35:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [08:36:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [08:36:19] (03CR) 10FNegri: [C:03+1] P:toolforge: Move ru_monuments backwards compat redirect to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193448 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [08:40:21] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [08:40:26] (03CR) 10FNegri: P:toolforge: Move U-A/Referer blocks to HAProxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193449 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [08:40:57] (03CR) 10FNegri: [C:03+1] P:toolforge: Move http redirect rewrite to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193450 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [08:42:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [08:46:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:34] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v11.10.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1193786 [08:48:22] (03CR) 10FNegri: P:toolforge::proxy: Remove config moved to HAProxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193454 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [08:49:10] (03PS2) 10Slyngshede: site.pp add new test host for CAS [puppet] - 10https://gerrit.wikimedia.org/r/1193784 (https://phabricator.wikimedia.org/T406455) [08:49:18] (03CR) 10Majavah: P:toolforge: Move U-A/Referer blocks to HAProxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193449 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [08:49:24] (03PS1) 10Bartosz Wójtowicz: ml-services: Update docker image for article topic model on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193787 (https://phabricator.wikimedia.org/T371021) [08:49:27] (03CR) 10Slyngshede: site.pp add new test host for CAS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193784 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [08:50:26] (03CR) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [08:51:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:46] (03CR) 10FNegri: [C:03+1] P:toolforge: Move U-A/Referer blocks to HAProxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193449 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [08:52:57] (03CR) 10Muehlenhoff: site.pp add new test host for CAS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193784 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [08:56:30] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker-eqiad [09:00:39] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v11.10.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1193786 (owner: 10Elukey) [09:01:40] (03PS1) 10Clément Goubert: mw-debug: Allow quick wipe and restart of deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193788 [09:02:05] (03PS1) 10Elukey: Upstream release v11.10.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1193789 [09:02:06] (03PS3) 10Slyngshede: site.pp add new test host for CAS [puppet] - 10https://gerrit.wikimedia.org/r/1193784 (https://phabricator.wikimedia.org/T406455) [09:02:20] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v11.10.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1193789 (owner: 10Elukey) [09:02:29] (03CR) 10Slyngshede: site.pp add new test host for CAS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193784 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:02:33] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1193784 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:04:57] (03PS1) 10Muehlenhoff: Clean up site.pp entries following installserver migration to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1193790 (https://phabricator.wikimedia.org/T396487) [09:09:07] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [09:12:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:14:07] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [09:18:34] !log uploaded spicerack_11.10.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [09:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:55] (03CR) 10Slyngshede: [C:03+2] site.pp add new test host for CAS [puppet] - 10https://gerrit.wikimedia.org/r/1193784 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:22:18] (03CR) 10Slyngshede: [C:03+1] Clean up site.pp entries following installserver migration to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1193790 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [09:23:01] !log upgrade Envoy on schema* T403663 [09:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:04] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [09:25:24] (03CR) 10Lucas Werkmeister (WMDE): Change Portal talk namespace name for diqwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [09:26:22] (03PS1) 10MVernon: wmflib: discard new directory entries from swift_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/1193797 (https://phabricator.wikimedia.org/T404351) [09:26:34] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw [09:27:33] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193797 (https://phabricator.wikimedia.org/T404351) (owner: 10MVernon) [09:27:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw [09:30:13] (03PS2) 10MVernon: wmflib: discard new directory entries from swift_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/1193797 (https://phabricator.wikimedia.org/T404351) [09:30:24] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193797 (https://phabricator.wikimedia.org/T404351) (owner: 10MVernon) [09:33:57] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad [09:35:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad [09:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:57] (03PS1) 10Hnowlan: trafficserver: rest-gateway routes for rest.php: group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1193805 (https://phabricator.wikimedia.org/T406318) [09:53:08] (03PS3) 10MVernon: wmflib: discard new directory entries from swift_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/1193797 (https://phabricator.wikimedia.org/T404351) [09:53:51] (03CR) 10Vgutierrez: Add ua_internals policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193782 (https://phabricator.wikimedia.org/T400119) (owner: 10Fabfur) [09:54:13] (03CR) 10Elukey: [C:03+2] "Yeah we'd need to refactor it a bit, it is heavily Dell-based afaics :(" [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [09:54:15] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Use http-after-response for headers [puppet] - 10https://gerrit.wikimedia.org/r/1193451 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:54:24] (03CR) 10Majavah: [C:03+2] P:toolforge: Move ru_monuments backwards compat redirect to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193448 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:54:33] (03CR) 10Majavah: [C:03+2] P:toolforge: Move U-A/Referer blocks to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193449 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:54:56] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193797 (https://phabricator.wikimedia.org/T404351) (owner: 10MVernon) [09:55:03] (03CR) 10Majavah: P:toolforge::proxy: Remove config moved to HAProxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193454 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:55:16] !log fetch haproxy 2.8.16 on thirdparty/haproxy28-bullseye (apt.wm.o) - T406451 [09:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:51] (03PS4) 10Majavah: P:toolforge: Move U-A/Referer blocks to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193449 (https://phabricator.wikimedia.org/T283948) [09:55:51] (03PS5) 10Majavah: P:toolforge: Move http redirect rewrite to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193450 (https://phabricator.wikimedia.org/T283948) [09:55:51] (03PS2) 10Majavah: P:toolforge::proxy: Remove config moved to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193454 (https://phabricator.wikimedia.org/T283948) [09:56:34] 06SRE, 10Observability-Metrics: Infrastructure-related Grafana dashboards should not be split by data center - https://phabricator.wikimedia.org/T406472 (10Tgr) 03NEW [09:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:58] (03PS1) 10Elukey: osm: refactor swift scripts and make event-template dynamic [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) [09:58:31] (03PS2) 10Fabfur: haproxy: add ua_internals policy [puppet] - 10https://gerrit.wikimedia.org/r/1193782 (https://phabricator.wikimedia.org/T400119) [09:58:48] (03PS2) 10Elukey: osm: refactor swift scripts and make event-template dynamic [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) [09:59:21] (03CR) 10Majavah: [C:03+2] P:toolforge: Move U-A/Referer blocks to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193449 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:59:57] (03CR) 10CI reject: [V:04-1] osm: refactor swift scripts and make event-template dynamic [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:00:04] (03CR) 10FNegri: [C:03+1] P:toolforge::proxy: Remove config moved to HAProxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193454 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1000) [10:00:07] !log upgrade to haproxy 2.8.16 on cp7008 and cp7016 - T406451 [10:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:31] (03CR) 10Majavah: [C:03+2] P:toolforge::proxy: Remove config moved to HAProxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193454 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [10:00:32] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp[7008,7016].magru.wmnet} and A:cp - 2.8.16 upgrade () [10:00:38] (03CR) 10Majavah: [C:03+2] P:toolforge: Move http redirect rewrite to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1193450 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [10:01:48] (03CR) 10Vgutierrez: [C:03+1] haproxy: add ua_internals policy [puppet] - 10https://gerrit.wikimedia.org/r/1193782 (https://phabricator.wikimedia.org/T400119) (owner: 10Fabfur) [10:02:01] (03PS1) 10Hnowlan: trafficserver: rest-gateway routes for rest.php: group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1193808 (https://phabricator.wikimedia.org/T406318) [10:02:03] (03PS1) 10Hnowlan: trafficserver: rest-gateway routes for rest.php: group1 1% [puppet] - 10https://gerrit.wikimedia.org/r/1193809 (https://phabricator.wikimedia.org/T406318) [10:02:05] (03PS1) 10Hnowlan: trafficserver: rest-gateway routes for rest.php: group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1193810 (https://phabricator.wikimedia.org/T406318) [10:02:07] (03PS1) 10Hnowlan: trafficserver: rest-gateway routes for rest.php: group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1193811 (https://phabricator.wikimedia.org/T406318) [10:06:47] (03PS1) 10Elukey: Add fake secrets for role::maps::master_bookworm [labs/private] - 10https://gerrit.wikimedia.org/r/1193812 [10:07:29] (03CR) 10Elukey: [V:03+2 C:03+2] Add fake secrets for role::maps::master_bookworm [labs/private] - 10https://gerrit.wikimedia.org/r/1193812 (owner: 10Elukey) [10:12:40] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp[7008,7016].magru.wmnet} and A:cp - 2.8.16 upgrade () [10:12:55] !log restarting spamsasssin/clamav on VRTS to pick up OpenSSL updates [10:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:56] (03CR) 10Ladsgroup: [C:03+1] instances.yaml: add es2051 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1193059 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:17:17] (03CR) 10Ladsgroup: [C:03+1] es2051.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1193058 (owner: 10Federico Ceratto) [10:17:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:21:24] (03CR) 10Krinkle: "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1193287 (owner: 10Krinkle) [10:21:27] (03PS4) 10MVernon: wmflib: discard new directory entries from swift_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/1193797 (https://phabricator.wikimedia.org/T404351) [10:21:27] (03CR) 10Federico Ceratto: [C:03+2] es2051.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1193058 (owner: 10Federico Ceratto) [10:21:31] (03PS9) 10Krinkle: varnish: Refactor 08-mobile vtc to pair req/resp assertions [puppet] - 10https://gerrit.wikimedia.org/r/1193285 [10:21:34] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: add es2051 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1193059 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:21:46] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193797 (https://phabricator.wikimedia.org/T404351) (owner: 10MVernon) [10:21:51] (03PS3) 10Elukey: osm: refactor swift scripts and make event-template dynamic [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) [10:22:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:24:54] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:26:19] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:dse-k8s-worker-eqiad [10:27:03] jouncebot: nowandnext [10:27:03] For the next 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1000) [10:27:03] In 1 hour(s) and 32 minute(s): Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1200) [10:33:00] (03PS4) 10Elukey: osm: refactor swift scripts and make event-template dynamic [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) [10:33:27] (03CR) 10CI reject: [V:04-1] osm: refactor swift scripts and make event-template dynamic [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:33:50] !log restarting postfix to pick up openssl security updates [10:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:34] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:36:54] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.makevm for new host idp-test1005.wikimedia.org [10:36:56] !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox [10:36:57] (03PS5) 10Elukey: osm: refactor swift scripts and make event-template dynamic [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) [10:39:13] !log upgrading to haproxy 2.8.16 on magru - T406451 [10:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:19] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru and not P{cp7008.magru.wmnet} and A:cp - 2.8.16 upgrade () [10:39:39] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2043.codfw.wmnet'] [10:39:48] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru and not P{cp7016.magru.wmnet} and A:cp - 2.8.16 upgrade () [10:39:56] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2043.codfw.wmnet'] [10:40:43] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test1005.wikimedia.org - slyngshede@cumin1003" [10:40:48] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test1005.wikimedia.org - slyngshede@cumin1003" [10:40:48] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:40:48] !log slyngshede@cumin1003 START - Cookbook sre.dns.wipe-cache idp-test1005.wikimedia.org on all recursors [10:40:51] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp-test1005.wikimedia.org on all recursors [10:41:00] !log upgraded spicerack to 11.10.0 on all cumin nodes [10:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:18] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2044.codfw.wmnet'] [10:41:32] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2044.codfw.wmnet'] [10:41:50] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2044.codfw.wmnet'] [10:42:49] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2044.codfw.wmnet'] [10:44:11] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2044.codfw.wmnet'] [10:44:27] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2044.codfw.wmnet'] [10:50:38] (03PS1) 10Elukey: sre.hardware.upgrade-firmware: fix ssd/storage corner cases [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) [10:51:51] (03CR) 10Elukey: [C:04-1] "Of course this is a stupid attempt, since I need to expose the right firmware versions first. sigh." [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [10:52:28] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 214657 [10:53:20] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 214657 [10:53:48] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2045.codfw.wmnet'] [10:54:01] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2045.codfw.wmnet'] [10:54:08] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2046.codfw.wmnet'] [10:54:23] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2046.codfw.wmnet'] [10:54:31] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2047.codfw.wmnet'] [10:57:42] (03CR) 10CI reject: [V:04-1] sre.hardware.upgrade-firmware: fix ssd/storage corner cases [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [11:10:07] (03PS1) 10KartikMistry: cxserver: staging: Update to 2025-10-06-084053-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193821 (https://phabricator.wikimedia.org/T394982) [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:54] elukey@cumin2002 upgrade-firmware (PID 3869778) is awaiting input [11:15:13] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2047.codfw.wmnet'] [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:13] !log dropping interwiki table on group1 (T397367) [11:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:16] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [11:19:43] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru and not P{cp7016.magru.wmnet} and A:cp - 2.8.16 upgrade () [11:22:46] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru and not P{cp7008.magru.wmnet} and A:cp - 2.8.16 upgrade () [11:25:49] !log dropping interwiki table on group2 (T397367) [11:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:52] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [11:33:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11245757 (10Jclark-ctr) This server is currently out of warranty. It appears to be a non-RAID server. Could we pull a drive from a recently decommissioned se... [11:33:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11245759 (10Jclark-ctr) a:03Jclark-ctr [11:36:09] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406420#11245761 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [11:38:11] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for the Postfix Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1193822 (https://phabricator.wikimedia.org/T135991) [11:38:54] (03PS4) 10Stevemunene: Define airflow-wikidata airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) [11:39:16] (03CR) 10Stevemunene: Define airflow-wikidata airflow instance (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [11:39:53] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:44:16] (03CR) 10Gkyziridis: [C:03+1] "LGTM! Thnx for deploying!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193787 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [11:48:29] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406367#11245791 (10Jclark-ctr) Rebalanced pdu. what uneven between BA. <-> AA #1: Phase, BA:L2-L3, Active Power; Value: 1515 (power) high: 1475 [11:49:01] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406367#11245794 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [11:53:05] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update docker image for article topic model on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193787 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [11:53:06] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406248#11245801 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [11:53:23] reminder: a gerrit switchover will happen in a moment [11:55:18] (03Merged) 10jenkins-bot: ml-services: Update docker image for article topic model on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193787 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [11:57:30] what happens to the gate-and-submit queue during gerrit readonly? [11:57:43] I guess Zuul will fail to report the success, and the changes will have to be retried afterwards? [11:57:57] (03CR) 10Hnowlan: [C:03+1] mw-debug: Allow quick wipe and restart of deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193788 (owner: 10Clément Goubert) [12:00:04] arnaudb and hashar: Deploy window Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1200) [12:02:47] (03CR) 10Arnaudb: [C:03+2] gerrit: switchover from gerrit1003 to gerrit2003 [dns] - 10https://gerrit.wikimedia.org/r/1193082 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [12:02:53] (03CR) 10Arnaudb: [C:03+2] gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [12:03:08] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1193701 (https://phabricator.wikimedia.org/T405631) (owner: 10Muehlenhoff) [12:03:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.312s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:04:18] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [12:04:22] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.topology-check (exit_code=0) Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [12:04:30] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.failover from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [12:05:04] !log arnaudb@dns1004 START - running authdns-update [12:05:28] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.read-only-toggle from gerrit1003.wikimedia.org [12:05:48] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.read-only-toggle (exit_code=0) from gerrit1003.wikimedia.org [12:07:22] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.read-only-toggle from gerrit2003.wikimedia.org [12:07:25] !log stopped CI Jenkins [12:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:30] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.read-only-toggle (exit_code=0) from gerrit2003.wikimedia.org [12:08:28] !log upgrade Envoy on yarn/turnilo hosts T403663 [12:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:31] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [12:11:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:12:58] PROBLEM - jenkins_service_running on contint1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [12:13:16] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:13:18] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:13:22] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:14:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:14:11] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:14:18] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:14:54] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:02] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:22] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:30] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:30] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:38] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:54] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:17:46] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:18:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.257s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:20:25] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.failover (exit_code=99) from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [12:20:32] Suppose all of these are gerrit switchover related (except the MediaWikiLatencyExceeded one which can safely be ignored) [12:21:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:40] claime: the authdns-update ones for sure [12:22:12] the docker-reporter one I am not sure. Maybe cause it needs to clone/fetch operations/docker-images/production-images [12:22:14] I'm gonna silence authdns [12:22:19] thx! [12:22:26] claime: already did that [12:22:32] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [12:22:36] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.topology-check (exit_code=99) Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [12:22:37] sobanski: ah, it still showed for me [12:23:00] codfw mw-parsoid latency,I have no idea :) [12:23:26] sobanski: deleted my silence, only yours left [12:23:36] hashar: there's basically no more traffic to this [12:23:39] Thanks [12:23:41] Only things left can take a while [12:23:46] So it skews averages [12:25:09] Lucas_WMDE: re the gate-and-submit queue: Zuul keeps the queue and jobs run on Jenkins. But the jobs will fail to clone/fetch from Gerrit and thus fail. For the one that were ongoing, Zuul will fail to report back to Gerrit so they will appear unprocessed [12:25:26] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.failover from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [12:25:37] Lucas_WMDE: I have stopped jenkins so that when Gerrit is back I can bring it up and Zuul will be able to resume the jobs showing "queued" on https://integration.wikimedia.org/zuul/ [12:25:44] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.read-only-toggle from gerrit1003.wikimedia.org [12:25:49] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.read-only-toggle (exit_code=0) from gerrit1003.wikimedia.org [12:25:52] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.read-only-toggle from gerrit2003.wikimedia.org [12:25:57] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.read-only-toggle (exit_code=0) from gerrit2003.wikimedia.org [12:25:58] (Puppet has bring up Jenkins and that caused some jobs to fail) [12:26:15] so https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1193408 would need a new +2 [12:26:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.461s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:26:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:26:27] and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1193824 will need a recheck [12:27:19] !log arnaudb@cumin1003 END (ERROR) - Cookbook sre.gerrit.failover (exit_code=97) from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [12:28:37] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.failover from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [12:28:51] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.read-only-toggle from gerrit1003.wikimedia.org [12:28:54] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.read-only-toggle (exit_code=0) from gerrit1003.wikimedia.org [12:28:58] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.read-only-toggle from gerrit2003.wikimedia.org [12:29:02] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.read-only-toggle (exit_code=0) from gerrit2003.wikimedia.org [12:29:14] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.failover (exit_code=99) from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [12:31:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.461s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:32:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.576s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:36:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.576s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:37:23] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [12:37:56] (03CR) 10Arnaudb: [V:03+2 C:03+2] Revert "gerrit: switchover from gerrit1003 to gerrit2003" [dns] - 10https://gerrit.wikimedia.org/r/1193826 (owner: 10Arnaudb) [12:38:14] !log arnaudb@dns1004 START - running authdns-update [12:38:50] (03CR) 10Arnaudb: [V:03+2 C:03+2] "CI not seeing this, bypassing as per @hashar@free.fr agreement" [puppet] - 10https://gerrit.wikimedia.org/r/1193827 (owner: 10Arnaudb) [12:38:58] RECOVERY - jenkins_service_running on contint1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [12:39:11] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:39:20] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:39:46] !log arnaudb@dns1004 END - running authdns-update [12:39:56] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:41:02] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:41:04] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:41:06] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:41:06] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:41:09] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt sretest1005 - jclark@cumin1002" [12:41:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt sretest1005 - jclark@cumin1002" [12:41:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:41:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:41:30] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:41:30] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:41:40] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:41:54] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:42:42] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:42:46] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:42:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:43:05] (03PS1) 10Hashar: gerrit: drop /srv/gerrit/plugins [puppet] - 10https://gerrit.wikimedia.org/r/1193832 [12:43:18] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:43:18] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:43:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:43:46] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host sretest1005 [12:43:52] !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest1005 [12:44:06] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:44:11] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:44:44] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [12:44:57] (03CR) 10Hashar: "Once merged, one has to manually drop the empty directory on Gerrit hosts: `rmdir /srv/gerrit/plugins`" [puppet] - 10https://gerrit.wikimedia.org/r/1193832 (owner: 10Hashar) [12:46:20] We ran into issues with the cookbook / cumin before the Gerrit instances were switched over. All the changes have now been reverted and we'll try again after investigating [12:48:33] (03PS1) 10Muehlenhoff: Add missing Cumin alias for cloudrabbit/codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1193836 [12:48:45] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt sretest1005 - jclark@cumin1002" [12:48:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt sretest1005 - jclark@cumin1002" [12:48:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:49:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [12:49:56] should CI jobs be running for new patches pushed to gerrit now? [12:50:25] I'm not seeing CI jobs for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1193829 [12:50:58] kostajh: `recheck` it [12:51:05] gerrit was down [12:51:11] :-] [12:51:33] hashar: I did a rebase 4 minutes ago, after Gerrit was up again [12:51:42] which should have triggered the jobs [12:51:59] https://integration.wikimedia.org/zuul/#q=1193829 [12:52:06] it is not there [12:52:38] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11245918 (10Jclark-ctr) @Marostegui I see both drives now. where you able to re-sync md0 ? ` jclark@dbproxy1022:~$ lsblk -a NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS loop0... [12:52:40] https://integration.wikimedia.org/zuul/ is fairly empty [12:53:59] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11245930 (10Jclark-ctr) Service request was Denied due to Server out of warranty. Was able to call Dell and open Parts only claim. using 1 year limited warranty claim [12:54:18] grr connection limits of doom [12:55:07] !log Restarting Zuul. Deadlocked due to zombie connections with Gerrit [12:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:23] (03CR) 10Cappybaraa: Change Portal talk namespace name for diqwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [12:56:41] kostajh: I have restarted Zuul toc lear some zombie connections. Your change is processing now : https://integration.wikimedia.org/zuul/#q=1193829 [12:56:45] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11245950 (10Ladsgroup) That's dbproxy1022? [12:56:45] thanks for the report [12:56:56] hashar: thanks! [12:57:15] (that is a bug somewhere in Zuul low level code. I once thought I had a fix for it and eventually gave up) [12:57:50] (03CR) 10Fabfur: haproxy: add ua_internals policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193782 (https://phabricator.wikimedia.org/T400119) (owner: 10Fabfur) [12:58:04] (03CR) 10Majavah: Add missing Cumin alias for cloudrabbit/codfw1dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193836 (owner: 10Muehlenhoff) [12:59:37] (03CR) 10Fabfur: [C:03+2] haproxy: add ua_internals policy [puppet] - 10https://gerrit.wikimedia.org/r/1193782 (https://phabricator.wikimedia.org/T400119) (owner: 10Fabfur) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1300). [13:00:05] mfossati and Cappybaraa: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:09] o/ [13:01:46] hashar: “But the jobs will fail to clone/fetch from Gerrit and thus fail.” – I thought reading from gerrit was supposed to work throughout? does that not include cloning? [13:01:49] anyway, thanks for handling it :) [13:02:11] are we okay to start with the backport+config window? [13:02:17] hello! [13:02:54] Lucas_WMDE: ok for me [13:03:07] just wondering what the status of the gerrit migration is [13:03:18] (I was afk for lunch while it was happening, trying to read up on it now ^^) [13:04:10] right, good point [13:05:20] AFAICT gerrit is writable again and gate-and-submit builds can at least be started [13:05:37] I assume it’s okay to deploy [13:05:45] mfossati: want to deploy your own config change or should I do it? [13:06:04] I can do that [13:06:06] ok! [13:06:25] let's go! [13:07:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192578 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [13:07:06] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-debug: Allow quick wipe and restart of deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193788 (owner: 10Clément Goubert) [13:07:53] (03Merged) 10jenkins-bot: ReaderExperiments' ImageBrowsing: use edge uniques [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192578 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [13:08:14] !log mfossati@deploy2002 Started scap sync-world: Backport for [[gerrit:1192578|ReaderExperiments' ImageBrowsing: use edge uniques (T403259)]] [13:08:17] T403259: Instrument image browsing interactions - https://phabricator.wikimedia.org/T403259 [13:10:04] (03PS1) 10Federico Ceratto: migrate.py: MariaDB version migration cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) [13:11:11] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [13:11:26] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [13:11:33] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [13:11:47] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [13:12:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:12:15] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11245978 (10Jclark-ctr) Ahh i also see whats wrong with dbproxy1024. Can I clear preserved cache on Raid controller? STOR305: Unable to complete the operation because preserved cache present on... [13:12:57] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2050.codfw.wmnet'] [13:13:38] (03CR) 10Ssingh: [C:03+1] url_downloader: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1193701 (https://phabricator.wikimedia.org/T405631) (owner: 10Muehlenhoff) [13:13:46] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2050.codfw.wmnet'] [13:14:41] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2051.codfw.wmnet'] [13:14:53] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2051.codfw.wmnet'] [13:14:57] !log mfossati@deploy2002 mfossati: Backport for [[gerrit:1192578|ReaderExperiments' ImageBrowsing: use edge uniques (T403259)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:14:59] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2052.codfw.wmnet'] [13:15:00] T403259: Instrument image browsing interactions - https://phabricator.wikimedia.org/T403259 [13:15:25] !log mfossati@deploy2002 mfossati: Continuing with sync [13:18:00] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11245989 (10Marostegui) >>! In T405804#11245978, @Jclark-ctr wrote: > Ahh i also see whats wrong with dbproxy1024. Can I clear preserved cache on Raid controller? > > > STOR305: Unable to compl... [13:18:50] (03PS2) 10Muehlenhoff: Clean up site.pp entries following installserver migration to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1193790 (https://phabricator.wikimedia.org/T396487) [13:19:03] (03PS1) 10Arnaudb: Revert^2 "gerrit: Switchover gerrit1003 → gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1193845 [13:19:19] (03PS1) 10Arnaudb: Revert^2 "gerrit: switchover from gerrit1003 to gerrit2003" [dns] - 10https://gerrit.wikimedia.org/r/1193846 [13:19:47] !log mfossati@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192578|ReaderExperiments' ImageBrowsing: use edge uniques (T403259)]] (duration: 11m 32s) [13:19:49] (03CR) 10Lucas Werkmeister (WMDE): "Please join `#wikimedia-operations` on [Libera Chat](https://libera.chat/) so we can start the deployment soon :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [13:20:33] Cappybaraa is online but only in -releng so far, apparently [13:20:53] Lucas_WMDE: all done here [13:20:58] thanks! [13:21:06] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11245994 (10Jclark-ctr) It wouldn’t let me create a VD for the replacement drive because of preserved cache on the controller from the failed drive. I talked to Amir via IRC, cleared the cache, and... [13:21:06] let’s see if the other person shows up [13:21:37] Lucas_WMDE: I have a patch to deploy whenever you're done [13:23:01] kostajh: right now there’s nothing to do, so feel free to deploy imho [13:23:07] hopefully Cappybaraa will still show up [13:23:43] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11246016 (10Ladsgroup) I try to add it [13:24:40] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo and A:cp - 2.8.16 upgrade () [13:24:43] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo and A:cp - 2.8.16 upgrade () [13:25:12] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11246034 (10Ladsgroup) In progress: ` ladsgroup@dbproxy1024:~$ cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sda2[1]... [13:25:23] (03CR) 10Muehlenhoff: [C:03+2] Clean up site.pp entries following installserver migration to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1193790 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [13:29:22] (03CR) 10Ssingh: [C:03+2] site.pp and preseed.yaml: add hcaptcha VMs [puppet] - 10https://gerrit.wikimedia.org/r/1193423 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [13:29:53] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{dse-k8s-worker[1004-1019].eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [13:33:57] kostajh: do you want to deploy? [13:34:05] cappybaraa seems to be having IRC issues :/ [13:34:12] Lucas_WMDE: yes, just waiting for CI to finish up on the master branch patch [13:34:15] ah ok [13:34:55] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:35:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11246078 (10Marostegui) Next time it is probably easier just to reimage as these hosts have no data at all [13:35:58] Lucas_WMDE: mind if I deploy a config patch? [13:37:00] cdanis: kostajh had a patch to deploy [13:37:07] although, if that’s a backport, maybe you can squeeze in the config change first [13:37:13] cdanis: you can go ahead, I can go after you [13:37:17] ok [13:37:29] !log bwojtowicz@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:39:09] Hi! Could someone please deploy my change https://gerrit.wikimedia.org/r/c/MediaWiki/extensions/Scribunto/+/1191845 ? Thanks! [13:39:18] I just started mine Lucas_WMDE [13:39:28] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1192950 [13:39:30] cappybaraa: we can do your config change after cdanis is done [13:39:45] okay, thanks [13:39:57] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1192950|EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams (T304373)]] [13:40:01] T304373: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 [13:40:08] I think there should be enough time for it [13:40:23] jouncebot: nowandnext [13:40:24] For the next 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1300) [13:40:24] In 0 hour(s) and 49 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1430) [13:40:25] kostajh’s change might run over a bit, but there’s half an hour between the end of this window and the start of the next one [13:40:46] CI for code backports takes way too long :( [13:42:09] Lucas_WMDE: can we merge this one please? That test failed again on the patch I'm waiting to merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1190236 [13:43:07] * Lucas_WMDE looks at https://phabricator.wikimedia.org/T388228 and sighs a bit [13:43:22] ok so the error you’ve seen again is *not* the one that I tried to fix in the attached gerrit change [13:43:43] but rather the one where I have no clue why it would be tied to this specific test [13:44:24] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha1001.wikimedia.org [13:44:25] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [13:44:49] (03PS1) 10Kosta Harlan: UserInfoCard: Limit who can view past blocks and remove redundant data points [extensions/CheckUser] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193859 (https://phabricator.wikimedia.org/T406480) [13:45:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CheckUser] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193859 (https://phabricator.wikimedia.org/T406480) (owner: 10Kosta Harlan) [13:45:16] cdanis: let me know when you're done, please [13:45:19] will do [13:45:37] we're at 7 minutes and it hasn't finished testservers yet heh [13:45:38] (03CR) 10Dreamy Jazz: [C:03+1] UserInfoCard: Limit who can view past blocks and remove redundant data points [extensions/CheckUser] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193859 (https://phabricator.wikimedia.org/T406480) (owner: 10Kosta Harlan) [13:45:46] kostajh: cappybaraa’s config change is up first, please [13:45:52] oh, ok [13:46:06] * Lucas_WMDE looks at the backport [13:46:17] can we sync them together? [13:46:35] !log cdanis@deploy2002 cdanis, otto: Backport for [[gerrit:1192950|EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams (T304373)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:46:36] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha2001.wikimedia.org [13:46:39] T304373: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 [13:46:44] I guess we could just deploy them together [13:46:51] yeah [13:46:55] !log cdanis@deploy2002 cdanis, otto: Continuing with sync [13:46:55] looks harmless enough [13:47:54] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha1001.wikimedia.org - sukhe@cumin1003" [13:47:59] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha1001.wikimedia.org - sukhe@cumin1003" [13:47:59] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:47:59] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha1001.wikimedia.org on all recursors [13:48:02] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha1001.wikimedia.org on all recursors [13:48:23] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [13:49:50] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CheckUser] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193859 (https://phabricator.wikimedia.org/T406480) (owner: 10Kosta Harlan) [13:51:35] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2053.codfw.wmnet'] [13:51:49] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2053.codfw.wmnet'] [13:51:57] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2054.codfw.wmnet'] [13:52:09] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha2001.wikimedia.org - sukhe@cumin1003" [13:52:09] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2054.codfw.wmnet'] [13:52:13] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha2001.wikimedia.org - sukhe@cumin1003" [13:52:13] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:52:13] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha2001.wikimedia.org on all recursors [13:52:16] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha2001.wikimedia.org on all recursors [13:52:17] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2055.codfw.wmnet'] [13:52:21] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192950|EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams (T304373)]] (duration: 12m 24s) [13:52:23] (03CR) 10Kosta Harlan: Change Portal talk namespace name for diqwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [13:52:25] T304373: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 [13:52:55] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha2001.wikimedia.org - sukhe@cumin1003" [13:52:59] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha2001.wikimedia.org - sukhe@cumin1003" [13:53:43] cdanis: can I continue deploying? [13:53:57] (03CR) 10Clément Goubert: [C:03+1] "Note that this will change the stat prefixes for the ex-restbase routes, since the route group was named `mw-api-int`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193389 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [13:54:37] (03CR) 10Clément Goubert: [C:03+2] mw-debug: Allow quick wipe and restart of deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193788 (owner: 10Clément Goubert) [13:54:43] (03CR) 10Clément Goubert: [C:03+2] deployment_server: Add optional scap-clean-images systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1192573 (https://phabricator.wikimedia.org/T401647) (owner: 10Ahmon Dancy) [13:55:01] Lucas_WMDE: please [13:55:12] ok thanks! [13:55:18] (03PS1) 10Muehlenhoff: Add Cumin aliases for new Zuul hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193862 [13:55:20] (there’s no rush, the backport will need like 7 more minutes in CI anyway) [13:55:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [13:55:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193859 (https://phabricator.wikimedia.org/T406480) (owner: 10Kosta Harlan) [13:56:00] sukhe@cumin1003 makevm (PID 1523514) is awaiting input [13:56:23] (03Merged) 10jenkins-bot: Change Portal talk namespace name for diqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [13:56:40] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha2001.wikimedia.org with OS trixie [13:57:09] (03Merged) 10jenkins-bot: mw-debug: Allow quick wipe and restart of deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193788 (owner: 10Clément Goubert) [13:57:34] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:57:41] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:57:52] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:58:04] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:58:11] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:58:19] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:58:59] claime: is that hopefully going to speed up the “stuck at 75%” phase of scap deploying to the test servers? [13:59:10] (just curious / checking if I understand it correctly ^^) [13:59:12] Lucas_WMDE: yes [13:59:14] nice [13:59:38] Basically giving the permission to k8s to not care about these pod availabilities, and just destroy and recreate [13:59:44] yeah, makes sense [14:00:00] should be quick enough that it does not alert or anything [14:00:01] (though I wouldn’t be surprised if there’s still occasional bug reports about it ^^) [14:00:02] but we'll see [14:00:33] btw, I didn't know about that "stuck at 75%" issue until this morning [14:00:52] https://bash.toolforge.org/quip/AVs7pBmZQMK9DA-FLCNR [14:01:10] Lucas_WMDE: yeah, basically this [14:01:43] (03CR) 10Jcrespo: [C:04-1] "I will advance for now on a GitLab proposal that covers both datacenters." [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [14:02:28] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406166#11246214 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:03:21] (03Merged) 10jenkins-bot: UserInfoCard: Limit who can view past blocks and remove redundant data points [extensions/CheckUser] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193859 (https://phabricator.wikimedia.org/T406480) (owner: 10Kosta Harlan) [14:03:45] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1191861|Change Portal talk namespace name for diqwiki (T328207)]], [[gerrit:1193859|UserInfoCard: Limit who can view past blocks and remove redundant data points (T406480)]] [14:03:51] T328207: Change Namespace Aliases on diq.wikipedia - https://phabricator.wikimedia.org/T328207 [14:03:51] T406480: Limit who can view past blocks and remove redundant data points - https://phabricator.wikimedia.org/T406480 [14:04:13] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp2052.codfw.wmnet'] [14:04:36] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo and A:cp - 2.8.16 upgrade () [14:06:00] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, kharlan, cappybaraa: Backport for [[gerrit:1191861|Change Portal talk namespace name for diqwiki (T328207)]], [[gerrit:1193859|UserInfoCard: Limit who can view past blocks and remove redundant data points (T406480)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:06:01] (03CR) 10Elukey: [C:03+1] provision: ensure CSMSupport is enabled in MBR mode [cookbooks] - 10https://gerrit.wikimedia.org/r/1193511 (owner: 10JHathaway) [14:06:12] (03PS1) 10Jelto: gitlab: add check for object storage credentials [puppet] - 10https://gerrit.wikimedia.org/r/1193864 (https://phabricator.wikimedia.org/T406234) [14:06:36] lol from 4 minutes to 24 seconds [14:06:39] looking on mwdebug [14:06:44] * Lucas_WMDE looks at the last deploy [14:06:50] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo and A:cp - 2.8.16 upgrade () [14:06:58] yeah [14:07:00] wait [14:07:03] ok but I feel like the 4 minutes must’ve been an outlier [14:07:04] containers aren't up [14:07:08] Lucas_WMDE: it wasn't [14:07:11] I don’t remember it taking *that* long [14:07:12] really? [14:07:36] so it kinda makes it racey to do this [14:07:43] Lucas_WMDE: lgtm [14:07:57] > 16:01 *** cappybaraa (~cappybara@user/cappybaraa) has quit (Quit: Client closed) [14:08:01] -.- [14:08:04] Lucas_WMDE: https://logstash.wikimedia.org/goto/cf9ee42ba2c8ec0bef57ee1d5afe0f37 [14:08:06] look at the urls [14:08:13] s/urls/durations/ [14:08:17] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7202/console" [puppet] - 10https://gerrit.wikimedia.org/r/1193864 (https://phabricator.wikimedia.org/T406234) (owner: 10Jelto) [14:08:22] Brain's wonky today [14:08:32] (03CR) 10Mooeypoo: mediawiki-engineering: Add REST API alerts with thresholds (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse) [14:09:00] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: codfw: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406167#11246249 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:09:09] yeah so, setting it up like this makes it very fast, but it also makes scap not wait long enough I think [14:09:14] thanks, my IRC client chose the perfect moment to hang up, sending me wondering which URLs I was supposed to be looking at :D [14:09:19] hm ok [14:09:23] anyway I better go test cappybaraa’s change [14:09:34] because the helmfile call probably returns immediately, not waiting for any container to be up [14:09:35] elukey@cumin2002 upgrade-firmware (PID 3957564) is awaiting input [14:09:52] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: add check for object storage credentials [puppet] - 10https://gerrit.wikimedia.org/r/1193864 (https://phabricator.wikimedia.org/T406234) (owner: 10Jelto) [14:10:32] lgtm [14:10:39] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, kharlan, cappybaraa: Continuing with sync [14:13:07] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha2001.wikimedia.org with reason: host reimage [14:13:35] elukey@cumin2002 upgrade-firmware (PID 3957564) is awaiting input [14:15:15] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191861|Change Portal talk namespace name for diqwiki (T328207)]], [[gerrit:1193859|UserInfoCard: Limit who can view past blocks and remove redundant data points (T406480)]] (duration: 11m 31s) [14:15:21] T328207: Change Namespace Aliases on diq.wikipedia - https://phabricator.wikimedia.org/T328207 [14:15:21] T406480: Limit who can view past blocks and remove redundant data points - https://phabricator.wikimedia.org/T406480 [14:17:36] !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: namespaceDupes diqwiki --fix # T328207 [14:18:19] (03CR) 10CDanis: [C:03+2] WMF-Uniq -> analytics: better stats & privacy [puppet] - 10https://gerrit.wikimedia.org/r/1191708 (https://phabricator.wikimedia.org/T405783) (owner: 10CDanis) [14:18:21] (03CR) 10CDanis: [C:03+2] benthos: switch to new & improved wmfuniq fields [puppet] - 10https://gerrit.wikimedia.org/r/1192576 (https://phabricator.wikimedia.org/T405783) (owner: 10CDanis) [14:19:02] !log UTC afternoon backport+config window done [14:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:23] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha2001.wikimedia.org with reason: host reimage [14:24:30] So from what I can tell from scap code, it would actually wait to have 2 available replicas in the new replicaset, but since minReadySeconds is 0, the pods are considered available as soon as they are created [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1430) [14:34:32] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp - 2.8.16 upgrade () [14:34:36] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw and A:cp - 2.8.16 upgrade () [14:34:45] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha2001.wikimedia.org with OS trixie [14:34:45] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha2001.wikimedia.org [14:35:51] !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha2002.wikimedia.org [14:35:52] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [14:36:49] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2055.codfw.wmnet'] [14:37:10] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2056.codfw.wmnet'] [14:37:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11246359 (10cmooney) p:05Triage→03Medium [14:37:15] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2056.codfw.wmnet'] [14:37:36] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2057.codfw.wmnet'] [14:37:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11246363 (10cmooney) p:05Triage→03Low [14:37:50] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2057.codfw.wmnet'] [14:37:56] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2058.codfw.wmnet'] [14:38:09] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2058.codfw.wmnet'] [14:39:21] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha2002.wikimedia.org - sukhe@cumin1003" [14:39:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy1028.eqiad.wmnet with reason: Maintenance [14:39:59] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha2002.wikimedia.org - sukhe@cumin1003" [14:39:59] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:39:59] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha2002.wikimedia.org on all recursors [14:40:02] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha2002.wikimedia.org on all recursors [14:40:34] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha2002.wikimedia.org - sukhe@cumin1003" [14:40:39] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha2002.wikimedia.org - sukhe@cumin1003" [14:40:55] !log marostegui@cumin1003 START - Cookbook sre.hosts.provision for host dbproxy1028.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:41:03] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [14:42:23] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1028.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:42:34] jouncebot: nowandnext [14:42:35] For the next 0 hour(s) and 17 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1430) [14:42:35] In 0 hour(s) and 47 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1530) [14:42:51] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha2002.wikimedia.org with OS trixie [14:43:41] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11246391 (10Marostegui) @robh I can only talk about dbproxy* hosts - they are both standby (as of today) so you can proceed whenever yo... [14:46:32] (03PS1) 10Santiago Faci: xLab: Deploying v1.0.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193878 (https://phabricator.wikimedia.org/T396578) [14:48:04] 06SRE, 06Infrastructure-Foundations: move human users out of UID range for system accounts - https://phabricator.wikimedia.org/T114446#11246430 (10LSobanski) 05Open→03Declined This one falls into "too much effort to fix and while technically breaking a rule doesn't cause issues that are worth fixing it" [14:48:22] 06SRE, 06Infrastructure-Foundations, 06Traffic: Updating Netbox for LVS hosts in eqiad lvs10(1[789]|20) - https://phabricator.wikimedia.org/T334884#11246433 (10cmooney) @ssingh I think we can probbaly close this one? Currently the hosts are representing reality (even if the process could be improved), and g... [14:50:29] 10ops-drmrs: drmrs: remove old lvs secondary links - https://phabricator.wikimedia.org/T396603#11246438 (10RobH) > Dear customer, > > We are standing in front of your rack and we see that the equipment you mentioned is not labeled, so we need more details in order to disconnect the cables. > Below are photos of... [14:50:53] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.failover from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [14:51:00] 06SRE, 06Infrastructure-Foundations, 06Traffic: Updating Netbox for LVS hosts in eqiad lvs10(1[789]|20) - https://phabricator.wikimedia.org/T334884#11246439 (10ssingh) 05Open→03Resolved a:03ssingh Thanks for following up @cmooney. Other than the fact that we owed you a response on your last comment... [14:51:08] hu disregard this plz [14:51:30] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.failover (exit_code=99) from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [14:54:36] 06SRE, 06Infrastructure-Foundations: Add a systemd unit for DHCP - https://phabricator.wikimedia.org/T251112#11246469 (10MoritzMuehlenhoff) 05Open→03Declined ISC has stopped development of dhcpd, so we will be migrating to ISC Kea and also won't be adding a systemd unit anymore [14:55:22] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [14:55:27] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test1005.wikimedia.org - slyngshede@cumin1003" [14:55:57] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test1005.wikimedia.org - slyngshede@cumin1003" [14:56:31] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host idp-test1005.wikimedia.org with OS trixie [14:57:02] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:57:39] (03PS3) 10Hnowlan: rest-gateway: use mw-api-ext rather than mw-api-int for all APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193389 (https://phabricator.wikimedia.org/T401396) [14:57:47] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:57:55] (03PS3) 10Elukey: [DNM] provision: remove some idrac10 cpu settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1185057 [14:58:05] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:58:05] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host hcaptcha1001.wikimedia.org [14:58:06] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha2002.wikimedia.org with reason: host reimage [14:58:26] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:03:09] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha2002.wikimedia.org with reason: host reimage [15:04:19] (03PS1) 10Clément Goubert: gateway-check: Introduce regex matching [puppet] - 10https://gerrit.wikimedia.org/r/1193882 (https://phabricator.wikimedia.org/T406318) [15:06:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [15:06:18] !log installing libcpanel-json-xs-perl security updates [15:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:24] !log installing libxslt security updates [15:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [15:13:00] (03CR) 10MVernon: "Tested to make sure it's actually a no-op (PCC doesn't recalculate the facts) thus (newlines added to try and make it a bit more readable)" [puppet] - 10https://gerrit.wikimedia.org/r/1193797 (https://phabricator.wikimedia.org/T404351) (owner: 10MVernon) [15:14:33] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp - 2.8.16 upgrade () [15:15:08] elukey@cumin2002 provision (PID 3991946) is awaiting input [15:18:00] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:18:30] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw and A:cp - 2.8.16 upgrade () [15:19:50] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha2002.wikimedia.org with OS trixie [15:19:50] !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha2002.wikimedia.org [15:21:10] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:21:51] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193883 (https://phabricator.wikimedia.org/T128546) [15:22:36] (03CR) 10Aude: [C:03+1] Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [15:23:39] 06SRE, 06Commons, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11246587 (10Prototyperspective) 05Stalled→03Open >>! In T405760#11231467, @Aklapper wrote: > What does the load chart in the network tab of your browser's developer tools show? Is there... [15:24:19] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:25:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11246593 (10elukey) I was able to update the firmware cookbook for IDRAC 10, and now we can do idrac+bios (still working on some issue with ssd, should be solve... [15:27:18] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test1005.wikimedia.org with reason: host reimage [15:29:47] (03CR) 10BCornwall: [C:03+1] varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193287 (owner: 10Krinkle) [15:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1530). [15:31:48] o/ deploying portal banner [15:31:57] (03CR) 10Hnowlan: "Added!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193389 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [15:32:40] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test1005.wikimedia.org with reason: host reimage [15:33:29] (03PS1) 10Ottomata: sqoop - Add centralauth globaluser and localuser tables [puppet] - 10https://gerrit.wikimedia.org/r/1193886 (https://phabricator.wikimedia.org/T389666) [15:34:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:08] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193883 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:38:53] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193883 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:39:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Add es2051 T402859', diff saved to https://phabricator.wikimedia.org/P83607 and previous config saved to /var/cache/conftool/dbconfig/20251006-153927-fceratto.json [15:39:31] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [15:39:53] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:44:02] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1193886 (https://phabricator.wikimedia.org/T389666) (owner: 10Ottomata) [15:46:41] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp-test1005.wikimedia.org with OS trixie [15:46:41] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp-test1005.wikimedia.org [15:46:49] 10ops-drmrs: drmrs: remove old lvs secondary links - https://phabricator.wikimedia.org/T396603#11246699 (10RobH) > The three patches have now been removed as requested. [15:47:24] 10ops-drmrs: drmrs: remove old lvs secondary links - https://phabricator.wikimedia.org/T396603#11246703 (10RobH) 05Open→03Resolved [15:47:41] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11246706 (10Jelto) In my opinion no downtime is needed for `gitlab-runners`. If you want to depool the runners this can be done using `depool` command on... [15:48:03] 10ops-drmrs: drmrs: remove old lvs secondary links - https://phabricator.wikimedia.org/T396603#11246709 (10RobH) All patches removed by remote hands, no downtime noted (so they pulled the correct cables) and the patches have been deleted out of netbox. [15:49:22] 06SRE, 06Commons, 06Reader Growth Team, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11246716 (10Jdlrobson-WMF) [15:53:47] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1193796| Bumping portals to master (T128546)]] (duration: 08m 59s) [15:53:51] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:55:11] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2051 gradually with 4 steps - Pooling in new host [15:55:48] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1193796| Bumping portals to master (T128546)]] (duration: 01m 59s) [15:58:21] (03PS1) 10BryanDavis: toolforge: wheel-of-misfortune: Exclude sshd-session [puppet] - 10https://gerrit.wikimedia.org/r/1193891 (https://phabricator.wikimedia.org/T406504) [15:59:58] (03CR) 10Ottomata: [C:03+2] sqoop - Add centralauth globaluser and localuser tables [puppet] - 10https://gerrit.wikimedia.org/r/1193886 (https://phabricator.wikimedia.org/T389666) (owner: 10Ottomata) [16:00:41] (03CR) 10JHathaway: [C:03+2] provision: ensure CSMSupport is enabled in MBR mode [cookbooks] - 10https://gerrit.wikimedia.org/r/1193511 (owner: 10JHathaway) [16:01:50] (03CR) 10Elukey: Introduce v1 xLab / MPIC SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [16:04:50] (03CR) 10Elukey: Introduce v1 xLab / MPIC SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [16:05:47] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:06:00] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:06:45] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:08:58] (03Merged) 10jenkins-bot: provision: ensure CSMSupport is enabled in MBR mode [cookbooks] - 10https://gerrit.wikimedia.org/r/1193511 (owner: 10JHathaway) [16:15:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11246838 (10Jhancock.wm) cp2050/52: try logging into root and using the racadm command. That's one of the symptoms when it needs to be used. cp2056: this is th... [16:17:12] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:19:55] (03CR) 10Elukey: profile::thanos: fix xlab SLI's recording rules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1193437 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [16:22:25] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host mc-misc2001.codfw.wmnet with OS bookworm [16:22:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#11246868 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host mc-misc2001.codfw.wmnet with OS bookworm [16:30:39] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp - 2.8.16 upgrade () [16:32:12] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp - 2.8.16 upgrade () [16:33:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.667s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:34:33] (03PS2) 10Elukey: sre.hardware.upgrade-firmware: fix ssd upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) [16:35:18] (03PS3) 10Elukey: sre.hardware.upgrade-firmware: fix ssd upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) [16:35:50] (03CR) 10Elukey: "All right I think I found a good-enough solution, lemme know!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [16:36:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:43] (03PS1) 10Ottomata: sqoop - fix name of hive centralauth tables [puppet] - 10https://gerrit.wikimedia.org/r/1193901 (https://phabricator.wikimedia.org/T389666) [16:38:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.259s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:39:34] (03CR) 10Joal: [C:03+1] sqoop - fix name of hive centralauth tables [puppet] - 10https://gerrit.wikimedia.org/r/1193901 (https://phabricator.wikimedia.org/T389666) (owner: 10Ottomata) [16:40:18] (03CR) 10Ottomata: [C:03+2] sqoop - fix name of hive centralauth tables [puppet] - 10https://gerrit.wikimedia.org/r/1193901 (https://phabricator.wikimedia.org/T389666) (owner: 10Ottomata) [16:40:39] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2051 gradually with 4 steps - Pooling in new host [16:42:12] 06SRE, 06Commons, 06Reader Growth Team, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11246977 (10bvibber) If we believe this is a networking issue I don't think Reader Growth Team can really do anything about it; this would be something you'd have to... [16:42:25] (03PS1) 10Clément Goubert: gateway-check: Group-based routing approach [puppet] - 10https://gerrit.wikimedia.org/r/1193903 (https://phabricator.wikimedia.org/T406318) [16:43:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.197s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:44:11] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:44:43] !log otto@deploy2002 Started deploy [analytics/refinery@21fe78f]: deploying analytics/refinery to an-launcher1002 to pick up change for T389666 [16:44:47] T389666: NEW/CHANGE FEATURE REQUEST: make available the centralauth.globaluser table in Data Lake - https://phabricator.wikimedia.org/T389666 [16:45:18] 06SRE, 06Commons, 06Reader Growth Team, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11246986 (10Prototyperspective) Maybe I shouldn't have written that I also use a VPN. **The site is exempt from the VPN. I'm not using a VPN for it.** Because of the... [16:46:33] 06SRE, 06Commons, 06Reader Growth Team, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11246990 (10CDanis) @Prototyperspective Can you please post the output of the traceroute step in [[ https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivit... [16:46:35] !log otto@deploy2002 Finished deploy [analytics/refinery@21fe78f]: deploying analytics/refinery to an-launcher1002 to pick up change for T389666 (duration: 02m 11s) [16:47:10] 06SRE, 06Commons, 06Reader Growth Team, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11246992 (10bvibber) Yeah, that definitely sounds like a network problem, and needs to be resolved between SRE and your internet service provider. [16:48:10] (03PS1) 10CDanis: turnilo: new wmfuniq fields [puppet] - 10https://gerrit.wikimedia.org/r/1193904 (https://phabricator.wikimedia.org/T405783) [16:48:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.197s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:54:21] 06SRE, 06Commons, 06Reader Growth Team, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11247021 (10Prototyperspective) Okay, thanks for identifying the responsible team and for the link to the instructions on how to report this. I'll create a separate s... [16:56:54] 06SRE, 06Commons, 06Reader Growth Team, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11247050 (10bvibber) Oh you're on the right task! I just think we don't need to add Readers Growth Team to it. :D [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1700) [17:00:05] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T1700). [17:00:22] (03CR) 10Dzahn: [C:03+2] Add Cumin aliases for new Zuul hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193862 (owner: 10Muehlenhoff) [17:04:01] (03PS2) 10BryanDavis: toolforge: wheel-of-misfortune: Exclude sshd-session [puppet] - 10https://gerrit.wikimedia.org/r/1193891 (https://phabricator.wikimedia.org/T406504) [17:05:12] (03PS3) 10BryanDavis: toolforge: wheel-of-misfortune: Exclude sshd-session [puppet] - 10https://gerrit.wikimedia.org/r/1193891 (https://phabricator.wikimedia.org/T406504) [17:07:18] (03CR) 10VolkerE: [C:03+1] Remove old, unused ArticleSummaries Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193447 (https://phabricator.wikimedia.org/T406361) (owner: 10LorenMora) [17:08:31] (03CR) 10Jasmine: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1186006 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [17:12:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:12:19] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2051 - Depooling host [17:12:58] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2051 - Depooling host [17:13:58] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2051 gradually with 4 steps - Pooling in new host [17:15:18] (03CR) 10Majavah: [C:04-1] toolforge: wheel-of-misfortune: Exclude sshd-session (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1193891 (https://phabricator.wikimedia.org/T406504) (owner: 10BryanDavis) [17:17:57] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp - 2.8.16 upgrade () [17:19:26] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp - 2.8.16 upgrade () [17:25:53] (03PS2) 10Esanders: DiscussionTools: enable thanking comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122638 (https://phabricator.wikimedia.org/T366095) (owner: 10DLynch) [17:29:36] !log jasmine@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on wikikube-ctrl1001.eqiad.wmnet with reason: decom [17:31:14] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:34:31] (03PS1) 10DDesouza: Increase coverage of Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193911 (https://phabricator.wikimedia.org/T405577) [17:34:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193911 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [17:36:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11247151 (10Jhancock.wm) @elukey give 2056 a shot when you get on next. might have fixed it. if not i might have to get dell involved again. [17:39:57] (03PS2) 10DDesouza: Increase coverage of Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193911 (https://phabricator.wikimedia.org/T405577) [17:41:52] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:42:42] 06SRE, 06Commons, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11247155 (10Aklapper) [17:42:42] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc-misc2001.codfw.wmnet with OS bookworm [17:44:14] (03PS1) 10Scott French: mw-debug: revert to default maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193915 [17:44:16] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp - 2.8.16 upgrade () [17:44:17] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp - 2.8.16 upgrade () [17:51:46] (03PS1) 10Bking: opensearch-test: Raise default pod memory allocation (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193916 (https://phabricator.wikimedia.org/T397246) [17:59:25] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2051 gradually with 4 steps - Pooling in new host [18:00:21] (03CR) 10BryanDavis: toolforge: wheel-of-misfortune: Exclude sshd-session (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1193891 (https://phabricator.wikimedia.org/T406504) (owner: 10BryanDavis) [18:07:38] (03CR) 10RLazarus: [C:03+1] "Great commit message, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193915 (owner: 10Scott French) [18:19:06] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11247240 (10MKopec) Hello @Dzahn, is there anything else needed to proceed? [18:22:18] (03CR) 10Btullis: [C:03+1] opensearch-test: Raise default pod memory allocation (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193916 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [18:26:35] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! Applies cleanly apart from the usual whitespace errors." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1193538 (https://phabricator.wikimedia.org/T404134) (owner: 10Pppery) [18:26:41] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11247298 (10Dzahn) @MKopec It looks like you have provided everything needed. It's just that the "clinic duty" that handles access request changes every week. And just last week the sche... [18:27:31] (03CR) 10JHathaway: sre.hardware.upgrade-firmware: fix ssd upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [18:28:11] (03CR) 10Scott French: "Thanks for the review, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193915 (owner: 10Scott French) [18:28:16] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11247308 (10MKopec) Ok, thank you! [18:30:10] (03PS2) 10Ahmon Dancy: Allow deployment group to sudo -u mwbuilder scap clean-images [puppet] - 10https://gerrit.wikimedia.org/r/1192567 (https://phabricator.wikimedia.org/T387927) [18:30:17] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192567 (https://phabricator.wikimedia.org/T387927) (owner: 10Ahmon Dancy) [18:35:25] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/1192567/5080/" [puppet] - 10https://gerrit.wikimedia.org/r/1192567 (https://phabricator.wikimedia.org/T387927) (owner: 10Ahmon Dancy) [18:36:49] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp - 2.8.16 upgrade () [18:36:54] (03CR) 10Scott French: [C:03+2] mw-debug: revert to default maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193915 (owner: 10Scott French) [18:38:34] (03Merged) 10jenkins-bot: mw-debug: revert to default maxUnavailable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193915 (owner: 10Scott French) [18:40:32] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp - 2.8.16 upgrade () [18:42:05] (03PS1) 10Andrew Bogott: Add trixie-specific manifests for Openstack version Epoxy [puppet] - 10https://gerrit.wikimedia.org/r/1193921 [18:45:50] (03CR) 10Ahmon Dancy: "Scott, if you're okay with the traindev-staging name as-is, can I ask you to merge this? I can do it myself but I know that you have a pro" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350) (owner: 10Ahmon Dancy) [18:48:37] (03PS2) 10Andrew Bogott: Add trixie-specific manifests for Openstack version Epoxy [puppet] - 10https://gerrit.wikimedia.org/r/1193921 (https://phabricator.wikimedia.org/T406516) [18:50:06] (03CR) 10Ahmon Dancy: osm_master: Create /etc/wikimedia directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) (owner: 10Ahmon Dancy) [18:51:03] (03CR) 10Majavah: [C:04-1] toolforge: wheel-of-misfortune: Exclude sshd-session (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193891 (https://phabricator.wikimedia.org/T406504) (owner: 10BryanDavis) [18:51:37] (03CR) 10Andrew Bogott: [C:03+2] Add trixie-specific manifests for Openstack version Epoxy [puppet] - 10https://gerrit.wikimedia.org/r/1193921 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [18:53:23] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2005-dev.codfw.wmnet with OS trixie [18:54:16] jouncebot: nowandnext [18:54:20] No deployments scheduled for the next 1 hour(s) and 5 minute(s) [18:54:20] In 1 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T2000) [18:54:57] (03PS1) 10Samtar: ext.wikimediaEvents.WatchlistBaseline: Add page-visited [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193923 (https://phabricator.wikimedia.org/T401575) [18:55:30] (03PS1) 10MusikAnimal: WishRenderer: short-circuit and show error if proposer is invalid [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193924 (https://phabricator.wikimedia.org/T406194) [18:56:31] (03PS1) 10Ottomata: sqoop - fix centralauth job by using seperate script and adding it to sqoop-whole-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1193926 (https://phabricator.wikimedia.org/T389666) [18:58:50] (03CR) 10CI reject: [V:04-1] sqoop - fix centralauth job by using seperate script and adding it to sqoop-whole-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1193926 (https://phabricator.wikimedia.org/T389666) (owner: 10Ottomata) [19:00:20] (03PS2) 10Ottomata: sqoop - fix centralauth - use seperate script and add to sqoop-whole-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1193926 (https://phabricator.wikimedia.org/T389666) [19:01:30] !incidents [19:01:31] No incidents occurred in the past 24 hours for team SRE [19:01:46] oncall handover: everything quiet during the shift [19:01:52] thanks [19:02:03] * swfrench-wmf thumbs up [19:04:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193924 (https://phabricator.wikimedia.org/T406194) (owner: 10MusikAnimal) [19:04:22] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:05:12] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:05:35] (03Merged) 10jenkins-bot: WishRenderer: short-circuit and show error if proposer is invalid [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193924 (https://phabricator.wikimedia.org/T406194) (owner: 10MusikAnimal) [19:05:44] (03CR) 10Bking: [C:03+2] opensearch-test: Raise default pod memory allocation (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193916 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [19:06:01] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1193924|WishRenderer: short-circuit and show error if proposer is invalid (T406194)]] [19:06:04] T406194: InvalidArgumentException: Wishes must have a proposer! - https://phabricator.wikimedia.org/T406194 [19:06:38] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [19:07:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [19:07:38] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp - 2.8.16 upgrade () [19:07:39] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp - 2.8.16 upgrade () [19:09:59] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage [19:10:40] (03CR) 10BCornwall: [C:03+2] varnish: misc VTC quality of life improvements [puppet] - 10https://gerrit.wikimedia.org/r/1193287 (owner: 10Krinkle) [19:13:09] (03CR) 10Clare Ming: [C:03+2] xLab: Deploying v1.0.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193878 (https://phabricator.wikimedia.org/T396578) (owner: 10Santiago Faci) [19:13:24] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage [19:14:32] musikanimal: can you give me a ping when you've finished deploying that CommunityRequests patch please? :) [19:14:39] sure thing! [19:15:05] (03Merged) 10jenkins-bot: xLab: Deploying v1.0.6 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193878 (https://phabricator.wikimedia.org/T396578) (owner: 10Santiago Faci) [19:20:17] (03CR) 10Scott French: [C:03+1] "No objections to proceeding as is. While I would have personally named it differently, I don't really have a strong preference." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350) (owner: 10Ahmon Dancy) [19:20:46] (03CR) 10Scott French: [C:03+2] Add traindev-staging environment for mw-web and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350) (owner: 10Ahmon Dancy) [19:21:43] I don't know if something is wrong with the above deployment or not. It's been stuck on "Started build-and-push-container-images" for 13+ minutes, which is much longer than normal in my very limited experience as a deployer [19:22:17] musikanimal: `540 languages rebuilt out of 540` [19:22:28] musikanimal: Looks like your backport touched localisation files..... yeah that. [19:22:39] (03Merged) 10jenkins-bot: Add traindev-staging environment for mw-web and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350) (owner: 10Ahmon Dancy) [19:22:43] ah, I see. Okay, so no cause for alarm [19:22:49] thanks :) [19:23:00] Just keep twiddling your thumbs [19:23:14] lol [19:24:19] Thanks again Scott! [19:24:24] musikanimal: in this case, should probably be ~ 20-30 mins for a build, followed by ~ 20 mins for the deployment itself [19:24:31] dancy: no problem! [19:25:12] (03PS1) 10SBassett: Enable New UI and Multiple Module support for OATHAuth in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193928 (https://phabricator.wikimedia.org/T399644) [19:25:46] oh dear. That might leak into the UTC late backport window :/ [19:25:59] (03CR) 10SBassett: [C:04-2] "For this Thursday's (2025-10-09) late backport window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193928 (https://phabricator.wikimedia.org/T399644) (owner: 10SBassett) [19:26:00] but now I know to give time for this is i18n files are touched [19:26:07] *if [19:26:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193928 (https://phabricator.wikimedia.org/T399644) (owner: 10SBassett) [19:32:06] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1193924|WishRenderer: short-circuit and show error if proposer is invalid (T406194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:32:21] T406194: InvalidArgumentException: Wishes must have a proposer! - https://phabricator.wikimedia.org/T406194 [19:33:07] !log musikanimal@deploy2002 musikanimal: Continuing with sync [19:39:45] (03CR) 10Dzahn: "is it" [puppet] - 10https://gerrit.wikimedia.org/r/1193597 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [19:39:53] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:41:12] (03PS2) 10CDanis: turnilo: new wmfuniq fields [puppet] - 10https://gerrit.wikimedia.org/r/1193904 (https://phabricator.wikimedia.org/T405783) [19:41:34] (03CR) 10CDanis: [C:03+2] turnilo: new wmfuniq fields [puppet] - 10https://gerrit.wikimedia.org/r/1193904 (https://phabricator.wikimedia.org/T405783) (owner: 10CDanis) [19:45:01] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193924|WishRenderer: short-circuit and show error if proposer is invalid (T406194)]] (duration: 39m 00s) [19:45:17] TheresNoTime: done, finally! [19:45:25] musikanimal: thank you! :) [19:45:44] T406194: InvalidArgumentException: Wishes must have a proposer! - https://phabricator.wikimedia.org/T406194 [19:46:48] going to quickly(tm) deploy something before the window [19:47:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193923 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [19:49:08] (03PS1) 10CDanis: turnilo: fix field name typo [puppet] - 10https://gerrit.wikimedia.org/r/1193930 [19:49:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{dse-k8s-worker[1004-1019].eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [19:49:22] (03CR) 10Dzahn: [C:04-1] "we want this..after all.. going to add secret to private repo" [puppet] - 10https://gerrit.wikimedia.org/r/1192617 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:49:36] (03CR) 10CDanis: [C:03+2] turnilo: fix field name typo [puppet] - 10https://gerrit.wikimedia.org/r/1193930 (owner: 10CDanis) [19:53:30] (03PS1) 10Scott French: Introduce output DSL rendering for known_client objects [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1193931 [19:54:18] RECOVERY - MD RAID on dbproxy1024 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:55:22] (03Merged) 10jenkins-bot: ext.wikimediaEvents.WatchlistBaseline: Add page-visited [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193923 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [19:55:40] (03CR) 10Scott French: [V:03+2] "Tested locally at `3a03c76`." [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1193931 (owner: 10Scott French) [19:55:42] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1193923|ext.wikimediaEvents.WatchlistBaseline: Add page-visited (T401575)]] [19:55:54] (03CR) 10Scott French: [V:03+2 C:03+2] Introduce output DSL rendering for known_client objects [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1193931 (owner: 10Scott French) [19:56:30] T401575: WE1.4.3: Instrument watchlist - https://phabricator.wikimedia.org/T401575 [19:56:37] !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy: Introduce output DSL rendering for known_client objects - swfrench@cumin2002" [19:56:39] !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Introduce output DSL rendering for known_client objects - swfrench@cumin2002 [19:57:30] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Introduce output DSL rendering for known_client objects - swfrench@cumin2002 [19:57:31] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy: Introduce output DSL rendering for known_client objects - swfrench@cumin2002" [19:58:09] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp - 2.8.16 upgrade () [19:59:07] (03PS3) 10Dzahn: zuul: adjust config section for zuul auth operator [puppet] - 10https://gerrit.wikimedia.org/r/1192617 (https://phabricator.wikimedia.org/T395938) [19:59:49] fyi deployers, my deployment is going to overrun into the window slightly [20:00:03] !log samtar@deploy2002 samtar: Backport for [[gerrit:1193923|ext.wikimediaEvents.WatchlistBaseline: Add page-visited (T401575)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T2000). [20:00:05] arlolra and danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] TheresNoTime: ok [20:00:12] o/ [20:00:29] not a problem [20:02:55] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp - 2.8.16 upgrade () [20:03:16] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:03:22] !log samtar@deploy2002 samtar: Continuing with sync [20:03:39] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:04:01] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host mc-misc2001.codfw.wmnet with OS bookworm [20:04:11] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#11247671 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host mc-misc2001.codfw.wmnet with OS bookworm [20:06:39] is anyone else able to run the deployment window after my deploy is done? [20:07:09] I can deploy my patch, hes [20:07:21] I can deploy mine as well [20:07:33] ack [20:07:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio100[1-3] - https://phabricator.wikimedia.org/T405983#11247674 (10VRiley-WMF) a:03VRiley-WMF [20:09:55] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193923|ext.wikimediaEvents.WatchlistBaseline: Add page-visited (T401575)]] (duration: 14m 13s) [20:10:18] T401575: WE1.4.3: Instrument watchlist - https://phabricator.wikimedia.org/T401575 [20:10:25] arlolra: danisztls: done ^ :) [20:11:14] thanks [20:11:20] danisztls: do you mind if I start? [20:11:53] arlolra: no [20:12:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193179 (https://phabricator.wikimedia.org/T406250) (owner: 10Arlolra) [20:13:04] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to 26 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193179 (https://phabricator.wikimedia.org/T406250) (owner: 10Arlolra) [20:13:23] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1193179|Deploy Parsoid Read Views to 26 Wikipedias (T406250)]] [20:13:30] T406250: Parsoid Read Views to Wikipedia deploy ~2025-10-06 - https://phabricator.wikimedia.org/T406250 [20:14:22] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:12] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:16:12] !log jhancock@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-misc2001.codfw.wmnet with reason: host reimage [20:17:37] (03PS1) 10Jdlrobson: tempUserBanner: Set `relative` position to enable `z-index` [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193932 (https://phabricator.wikimedia.org/T404122) [20:17:48] (03PS2) 10Jdlrobson: tempUserBanner: Set `relative` position to enable `z-index` [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193932 (https://phabricator.wikimedia.org/T404122) [20:18:53] !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1193179|Deploy Parsoid Read Views to 26 Wikipedias (T406250)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:19:07] T406250: Parsoid Read Views to Wikipedia deploy ~2025-10-06 - https://phabricator.wikimedia.org/T406250 [20:19:46] !log arlolra@deploy2002 arlolra: Continuing with sync [20:20:35] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-misc2001.codfw.wmnet with reason: host reimage [20:24:07] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193179|Deploy Parsoid Read Views to 26 Wikipedias (T406250)]] (duration: 10m 43s) [20:24:10] T406250: Parsoid Read Views to Wikipedia deploy ~2025-10-06 - https://phabricator.wikimedia.org/T406250 [20:24:46] danisztls: all yours [20:24:50] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406524 (10phaultfinder) 03NEW [20:24:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193575 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [20:24:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193911 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [20:25:00] arlolra: thansk! [20:25:03] *thanks [20:25:49] (03Merged) 10jenkins-bot: Undeploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193575 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [20:25:53] (03Merged) 10jenkins-bot: Increase coverage of Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193911 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [20:26:13] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1193575|Undeploy reader foundational survey on enwiki (T405410)]], [[gerrit:1193911|Increase coverage of Design Research participant recruitment survey on jawiki (T405577)]] [20:26:20] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [20:26:20] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [20:30:34] !log dani@deploy2002 dani: Backport for [[gerrit:1193575|Undeploy reader foundational survey on enwiki (T405410)]], [[gerrit:1193911|Increase coverage of Design Research participant recruitment survey on jawiki (T405577)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:31:27] !log dani@deploy2002 dani: Continuing with sync [20:35:50] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193575|Undeploy reader foundational survey on enwiki (T405410)]], [[gerrit:1193911|Increase coverage of Design Research participant recruitment survey on jawiki (T405577)]] (duration: 09m 37s) [20:35:56] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [20:35:56] T405577: Deploy QuickSurvey for research participant registration drive on jawiki - https://phabricator.wikimedia.org/T405577 [20:36:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:36:51] all done [20:39:41] !log jhancock@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [20:40:05] !log jhancock@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [20:40:06] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-misc2001.codfw.wmnet with OS bookworm [20:40:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#11247785 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host mc-misc2001.codfw.wmnet with OS bookworm completed: - mc-misc2001 (**P... [20:41:20] 10ops-codfw, 06SRE, 06DC-Ops: mc-misc2001 won't power up - https://phabricator.wikimedia.org/T395526#11247788 (10Jhancock.wm) 05Open→03Resolved @MoritzMuehlenhoff This server is finally ready to be put back in. Thank you for your patience. [20:44:53] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:50:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 3 others: decommission druid100[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T403801#11247792 (10VRiley-WMF) 05Open→03Resolved [20:50:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 3 others: decommission druid100[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T403801#11247794 (10VRiley-WMF) This has been completed [20:51:25] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [20:51:30] (03PS1) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [20:51:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11247796 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm ex... [20:52:41] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:56:22] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp - 2.8.16 upgrade () [20:56:24] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp - 2.8.16 upgrade () [21:00:05] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T2100) [21:01:46] (03CR) 10VolkerE: [C:03+1] tempUserBanner: Set `relative` position to enable `z-index` [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193932 (https://phabricator.wikimedia.org/T404122) (owner: 10Jdlrobson) [21:02:40] (03CR) 10RLazarus: [C:03+2] kubernetes: Set default Envoy version to 1.29.12 [puppet] - 10https://gerrit.wikimedia.org/r/1191526 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [21:03:41] (03PS2) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [21:05:07] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:06:51] (03PS3) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [21:08:17] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:10:48] (03PS4) 10BryanDavis: toolforge: wheel-of-misfortune: Exclude sshd-session [puppet] - 10https://gerrit.wikimedia.org/r/1193891 (https://phabricator.wikimedia.org/T406504) [21:11:18] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:11:45] (03CR) 10BryanDavis: toolforge: wheel-of-misfortune: Exclude sshd-session (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1193891 (https://phabricator.wikimedia.org/T406504) (owner: 10BryanDavis) [21:12:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:14:15] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:15:19] Hey all - one sec patch to get out today... [21:15:30] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11247834 (10Krinkle) [21:21:04] (03PS1) 10Mstyles: OATHAuth: Increase 2FA opt-in to 40% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193941 (https://phabricator.wikimedia.org/T399664) [21:21:19] jhancock@cumin1002 provision (PID 3218240) is awaiting input [21:22:42] (03CR) 10SBassett: [C:03+1] OATHAuth: Increase 2FA opt-in to 40% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193941 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [21:23:28] (03PS1) 10Ryan Kemper: wdqs: Fix 3 federation endpoint URLs [puppet] - 10https://gerrit.wikimedia.org/r/1193942 (https://phabricator.wikimedia.org/T402905) [21:24:46] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:25:07] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:25:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:26:46] (03CR) 10Bking: [C:03+1] wdqs: Fix 3 federation endpoint URLs [puppet] - 10https://gerrit.wikimedia.org/r/1193942 (https://phabricator.wikimedia.org/T402905) (owner: 10Ryan Kemper) [21:26:49] (03CR) 10Ryan Kemper: [C:03+2] wdqs: Fix 3 federation endpoint URLs [puppet] - 10https://gerrit.wikimedia.org/r/1193942 (https://phabricator.wikimedia.org/T402905) (owner: 10Ryan Kemper) [21:28:11] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [21:28:31] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11247892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [21:29:44] !log Deployed security mitigation for T251032 [21:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:32:54] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [21:35:17] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp - 2.8.16 upgrade () [21:36:24] (03CR) 10JHathaway: [C:03+1] Enable profile::auto_restarts::service for the Postfix Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1193822 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [21:36:45] (03PS1) 10Bking: wdqs: Add soon-to-be-reimaged hosts to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1193943 (https://phabricator.wikimedia.org/T405978) [21:37:20] (03PS10) 10Krinkle: varnish: Refactor 08-mobile vtc to pair req/resp assertions [puppet] - 10https://gerrit.wikimedia.org/r/1193285 [21:37:20] (03PS12) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [21:38:28] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp - 2.8.16 upgrade () [21:38:33] (03PS1) 10Btullis: spark-operator: Update RBAC rules for job namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193944 (https://phabricator.wikimedia.org/T405490) [21:39:13] (03CR) 10Ryan Kemper: [C:03+1] wdqs: Add soon-to-be-reimaged hosts to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1193943 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [21:39:30] (03CR) 10Bking: [C:03+2] wdqs: Add soon-to-be-reimaged hosts to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1193943 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [21:43:28] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2017.codfw.wmnet with OS bullseye [21:43:52] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1018.eqiad.wmnet with OS bullseye [21:44:45] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1020.eqiad.wmnet with OS bullseye [21:47:15] (03CR) 10Catrope: [C:03+1] OATHAuth: Increase 2FA opt-in to 40% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193941 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [21:50:38] jhancock@cumin1002 reimage (PID 3235535) is awaiting input [21:59:52] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [22:00:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11247946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm ex... [22:02:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:05:07] (03CR) 10Bking: [C:03+1] spark-operator: Update RBAC rules for job namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193944 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [22:05:56] (03PS2) 10Scott French: P:conftool::requestctl_client: update requestctl_cli.original.py [puppet] - 10https://gerrit.wikimedia.org/r/1192616 (https://phabricator.wikimedia.org/T403220) [22:05:56] (03PS2) 10Scott French: P:conftool::hiddenparma: enable known_client_expression_validation [puppet] - 10https://gerrit.wikimedia.org/r/1192620 (https://phabricator.wikimedia.org/T403220) [22:06:38] (03CR) 10Btullis: [C:03+2] spark-operator: Update RBAC rules for job namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193944 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [22:07:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:14:04] (03Merged) 10jenkins-bot: spark-operator: Update RBAC rules for job namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193944 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [22:16:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:17:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and Init7 (2001:1620:1000::85) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:21:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:21:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:22:07] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [22:23:32] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [22:23:43] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [22:23:47] (03PS1) 10Jclark-ctr: add wikikube-ctrl2006 for efi booting to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1193947 [22:26:11] (03CR) 10CI reject: [V:04-1] add wikikube-ctrl2006 for efi booting to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1193947 (owner: 10Jclark-ctr) [22:30:31] (03CR) 10Jclark-ctr: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1193947 (owner: 10Jclark-ctr) [22:32:58] (03PS2) 10Jclark-ctr: add wikikube-ctrl2006 for efi booting to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1193947 [22:34:47] ryankemper@cumin2002 reimage (PID 896) is awaiting input [22:35:21] (03CR) 10CI reject: [V:04-1] add wikikube-ctrl2006 for efi booting to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1193947 (owner: 10Jclark-ctr) [22:36:30] bking@cumin2002 reimage (PID 1331) is awaiting input [22:37:27] (03PS3) 10Jclark-ctr: add wikikube-ctrl2006 for efi booting to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1193947 (https://phabricator.wikimedia.org/T400661) [22:37:32] (03CR) 10Jclark-ctr: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193947 (https://phabricator.wikimedia.org/T400661) (owner: 10Jclark-ctr) [22:38:08] (03PS1) 10Btullis: Revert "spark-operator: Update RBAC rules for job namespaces" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193948 [22:38:32] (03CR) 10RobH: [C:03+2] add wikikube-ctrl2006 for efi booting to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1193947 (https://phabricator.wikimedia.org/T400661) (owner: 10Jclark-ctr) [22:38:52] (03PS2) 10Clément Goubert: preseed: Set UEFI preseed for wikikube-ctrl2006 [puppet] - 10https://gerrit.wikimedia.org/r/1193075 (https://phabricator.wikimedia.org/T400661) [22:39:16] (03CR) 10RobH: [C:03+2] preseed: Set UEFI preseed for wikikube-ctrl2006 [puppet] - 10https://gerrit.wikimedia.org/r/1193075 (https://phabricator.wikimedia.org/T400661) (owner: 10Clément Goubert) [22:41:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/0/5 (Core: ssw1-d8-eqiad:ethernet-1/32 {#B00392}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:42:38] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [22:42:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and Init7 (2001:1620:1000::85) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:42:47] (03Abandoned) 10Jclark-ctr: add wikikube-ctrl2006 for efi booting to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1193947 (https://phabricator.wikimedia.org/T400661) (owner: 10Jclark-ctr) [22:43:30] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [22:48:53] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1020.eqiad.wmnet with OS bullseye [22:49:07] (03PS4) 10Dzahn: zuul: adjust config section for zuul auth operator [puppet] - 10https://gerrit.wikimedia.org/r/1192617 (https://phabricator.wikimedia.org/T395938) [22:49:30] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1020.eqiad.wmnet with OS bullseye [22:57:11] (03Abandoned) 10Btullis: Revert "spark-operator: Update RBAC rules for job namespaces" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193948 (owner: 10Btullis) [22:57:48] Can I start the web deploy window early? [22:59:49] (03CR) 10Santiago Faci: Add ReadingList Stream to EventStreamConfig (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251006T2300) [23:02:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193447 (https://phabricator.wikimedia.org/T406361) (owner: 10LorenMora) [23:02:59] (03Merged) 10jenkins-bot: Remove old, unused ArticleSummaries Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193447 (https://phabricator.wikimedia.org/T406361) (owner: 10LorenMora) [23:03:21] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1193447|Remove old, unused ArticleSummaries Stream (T406361)]] [23:03:24] T406361: Remove ArticleSummaries Stream from mediawiki-config - https://phabricator.wikimedia.org/T406361 [23:03:44] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2017.codfw.wmnet with OS bullseye [23:03:47] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [23:04:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11248078 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [23:06:22] (03PS1) 10Btullis: Allow the spark-operator controller to communicate with driver pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193951 (https://phabricator.wikimedia.org/T405490) [23:07:51] !log jdlrobson@deploy2002 jdlrobson, lmora: Backport for [[gerrit:1193447|Remove old, unused ArticleSummaries Stream (T406361)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:08:34] !log jdlrobson@deploy2002 jdlrobson, lmora: Continuing with sync [23:13:08] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193447|Remove old, unused ArticleSummaries Stream (T406361)]] (duration: 09m 47s) [23:13:12] T406361: Remove ArticleSummaries Stream from mediawiki-config - https://phabricator.wikimedia.org/T406361 [23:17:21] (03CR) 10Btullis: [C:03+2] Allow the spark-operator controller to communicate with driver pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193951 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [23:17:23] (testing) [23:19:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193932 (https://phabricator.wikimedia.org/T404122) (owner: 10Jdlrobson) [23:19:28] syncing [23:23:18] (03PS1) 10Dzahn: add fake secret for zuul auth operator [labs/private] - 10https://gerrit.wikimedia.org/r/1193952 (https://phabricator.wikimedia.org/T395938) [23:23:31] (03Merged) 10jenkins-bot: tempUserBanner: Set `relative` position to enable `z-index` [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193932 (https://phabricator.wikimedia.org/T404122) (owner: 10Jdlrobson) [23:23:54] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1193932|tempUserBanner: Set `relative` position to enable `z-index` (T404122)]] [23:23:57] T404122: Short term fix for temp accounts z-index blocker - https://phabricator.wikimedia.org/T404122 [23:24:04] (03CR) 10Dzahn: [V:03+2 C:03+2] add fake secret for zuul auth operator [labs/private] - 10https://gerrit.wikimedia.org/r/1193952 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:25:09] (03Merged) 10jenkins-bot: Allow the spark-operator controller to communicate with driver pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193951 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [23:28:02] !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1193932|tempUserBanner: Set `relative` position to enable `z-index` (T404122)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:28:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [23:29:07] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [23:30:13] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1192617/7208/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1192617 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:30:56] !log jdlrobson@deploy2002 jdlrobson: Continuing with sync [23:35:24] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193932|tempUserBanner: Set `relative` position to enable `z-index` (T404122)]] (duration: 11m 30s) [23:35:28] T404122: Short term fix for temp accounts z-index blocker - https://phabricator.wikimedia.org/T404122 [23:36:43] ok done :) [23:37:28] (03PS1) 10Dzahn: zuul: tighten file mode for new zuul config file [puppet] - 10https://gerrit.wikimedia.org/r/1193954 (https://phabricator.wikimedia.org/T395938) [23:38:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1193955 [23:38:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1193955 (owner: 10TrainBranchBot) [23:39:53] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:40:26] bking@cumin2002 reimage (PID 34369) is awaiting input [23:41:56] (03CR) 10BCornwall: [C:03+2] varnish: Refactor 08-mobile vtc to pair req/resp assertions [puppet] - 10https://gerrit.wikimedia.org/r/1193285 (owner: 10Krinkle) [23:46:07] (03CR) 10Dzahn: [C:03+2] zuul: tighten file mode for new zuul config file [puppet] - 10https://gerrit.wikimedia.org/r/1193954 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:52:20] jhancock@cumin1002 reimage (PID 3410860) is awaiting input [23:53:34] (03CR) 10BryanDavis: osm_master: Store kartotherian and tegola passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191680 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [23:53:38] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1193955 (owner: 10TrainBranchBot) [23:58:23] (03PS1) 10Dzahn: zuul: reduce code duplication for new zuul setup [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) [23:58:39] (03CR) 10CI reject: [V:04-1] zuul: reduce code duplication for new zuul setup [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)