[00:18:17] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:23:17] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:34:47] RESOLVED: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:39:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1241920 [00:39:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1241920 (owner: 10TrainBranchBot) [00:53:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1241920 (owner: 10TrainBranchBot) [01:00:17] FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:05:17] RESOLVED: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:08:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1241936 [01:08:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1241936 (owner: 10TrainBranchBot) [01:32:37] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1241936 (owner: 10TrainBranchBot) [02:00:50] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:08:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:40] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 50s) [02:19:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:28:22] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:21] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:42] RESOLVED: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:32:20] (03PS3) 10Anzx: Lift IP cap for Editathon on commonswiki, eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241057 (https://phabricator.wikimedia.org/T417830) [03:32:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241057 (https://phabricator.wikimedia.org/T417830) (owner: 10Anzx) [03:56:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T415786)', diff saved to https://phabricator.wikimedia.org/P88970 and previous config saved to /var/cache/conftool/dbconfig/20260223-035619-marostegui.json [03:56:24] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [04:11:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P88971 and previous config saved to /var/cache/conftool/dbconfig/20260223-041128-marostegui.json [04:26:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P88972 and previous config saved to /var/cache/conftool/dbconfig/20260223-042636-marostegui.json [04:39:09] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#11639026 (10aaron) I can't see how this can happen witho... [04:41:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T415786)', diff saved to https://phabricator.wikimedia.org/P88973 and previous config saved to /var/cache/conftool/dbconfig/20260223-044144-marostegui.json [04:41:49] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [04:42:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2246.codfw.wmnet with reason: Maintenance [04:42:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2246 (T415786)', diff saved to https://phabricator.wikimedia.org/P88974 and previous config saved to /var/cache/conftool/dbconfig/20260223-044209-marostegui.json [06:19:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:23:41] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1242167 (https://phabricator.wikimedia.org/T418079) [06:24:25] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1242168 (https://phabricator.wikimedia.org/T418080) [06:24:31] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1242169 (https://phabricator.wikimedia.org/T418080) [07:21:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T415786)', diff saved to https://phabricator.wikimedia.org/P88975 and previous config saved to /var/cache/conftool/dbconfig/20260223-072119-marostegui.json [07:21:24] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [07:31:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207758 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:36:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P88976 and previous config saved to /var/cache/conftool/dbconfig/20260223-073627-marostegui.json [07:39:39] (03CR) 10Muehlenhoff: [C:03+2] civicrm: Run spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240889 (owner: 10Muehlenhoff) [07:44:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:48:33] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::base_repo [puppet] - 10https://gerrit.wikimedia.org/r/1240856 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:49:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:50:14] (03CR) 10Jcrespo: [C:04-1] "This wouldn't work as is, still, and requires pupetization. Code is useful to have as a reference, but non-trivial amount of engineering t" [puppet] - 10https://gerrit.wikimedia.org/r/375349 (https://phabricator.wikimedia.org/T59617) (owner: 10Jcrespo) [07:51:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P88977 and previous config saved to /var/cache/conftool/dbconfig/20260223-075135-marostegui.json [07:53:15] (03PS1) 10Filippo Giunchedi: pontoon: bump default instance spec [puppet] - 10https://gerrit.wikimedia.org/r/1242227 [07:56:27] (03CR) 10Muehlenhoff: [C:03+2] cassandra: Run spec tests on Bullseye/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240717 (owner: 10Muehlenhoff) [07:56:34] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: bump default instance spec [puppet] - 10https://gerrit.wikimedia.org/r/1242227 (owner: 10Filippo Giunchedi) [08:00:05] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T0800). [08:00:05] Msz2001, matthiasmullie, anzx, and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:16] o/ [08:00:24] o/ [08:00:43] o/ [08:00:55] I'm a deployer and I can start deploying the patches [08:01:10] thanks! [08:01:35] I think I'll start with all the config ones at once, is it okay dcausse and anzx (hopefully they are here as well)? [08:02:27] Msz2001: perfectly fine, mine should only affect a maint job [08:04:00] Anzx doesn't appear to be here yet, so I'll skip their patch for now [08:04:15] o/ [08:04:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240892 (https://phabricator.wikimedia.org/T417877) (owner: 10Mszwarc) [08:04:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207758 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [08:04:34] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster class and related classes [puppet] - 10https://gerrit.wikimedia.org/r/1240853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:05:13] (03Merged) 10jenkins-bot: Ensure that sysops don't have '(oathauth-recover-for-user)' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240892 (https://phabricator.wikimedia.org/T417877) (owner: 10Mszwarc) [08:05:17] (03Merged) 10jenkins-bot: cirrus: enable default_sort for completion on a set of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207758 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [08:05:46] Msz2001: hi, anzx patch to change IP throttle can be merged, there is no testing involved beside the unit tests that run for throttles :) [08:06:07] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1240892|Ensure that sysops don't have '(oathauth-recover-for-user)' right (T417877)]], [[gerrit:1207758|cirrus: enable default_sort for completion on a set of wikis (T404858)]] [08:06:08] Okay, I can do it later, then [08:06:12] T417877: Document new rights and configuration in OATHAuth related to 2FA enforcement and grant the rights by default - https://phabricator.wikimedia.org/T417877 [08:06:13] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:06:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T415786)', diff saved to https://phabricator.wikimedia.org/P88978 and previous config saved to /var/cache/conftool/dbconfig/20260223-080644-marostegui.json [08:06:48] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [08:06:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance [08:06:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1249 (T415786)', diff saved to https://phabricator.wikimedia.org/P88979 and previous config saved to /var/cache/conftool/dbconfig/20260223-080657-marostegui.json [08:09:47] (03PS2) 10Elukey: profile::puppetserver: rework and fix the analytics-sre config [puppet] - 10https://gerrit.wikimedia.org/r/1241012 (https://phabricator.wikimedia.org/T402512) [08:10:15] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1241012 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [08:16:53] (03PS5) 10Arnaudb: gerrit: add mtail monitoring on replication [puppet] - 10https://gerrit.wikimedia.org/r/1238315 (https://phabricator.wikimedia.org/T418084) [08:17:10] (03CR) 10Arnaudb: "replication metrics were not reflecting the replication issue we had while the plugin did not reload properly" [puppet] - 10https://gerrit.wikimedia.org/r/1238315 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [08:22:34] (03CR) 10Hashar: [C:03+1] Revert^2 "Gerrit: Disable auto reloading replication config" [puppet] - 10https://gerrit.wikimedia.org/r/1238043 (https://phabricator.wikimedia.org/T416929) (owner: 10Hashar) [08:26:46] (03CR) 10Jcrespo: [C:03+1] installserver: use EFI booting for new apus frontends [puppet] - 10https://gerrit.wikimedia.org/r/1241010 (https://phabricator.wikimedia.org/T416387) (owner: 10MVernon) [08:27:25] (03CR) 10Muehlenhoff: [C:03+2] Move the puppetmaster puppetdb client class under puppet_compiler [puppet] - 10https://gerrit.wikimedia.org/r/1240278 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:27:48] 07Puppet, 06collaboration-services, 10Gerrit, 13Patch-For-Review: Gerrit git replication should not break when Puppet changes its config - https://phabricator.wikimedia.org/T416929#11639182 (10hashar) The short fix is to [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1238043 | disable configuratio... [08:28:01] (03CR) 10MVernon: [C:03+2] installserver: use EFI booting for new apus frontends [puppet] - 10https://gerrit.wikimedia.org/r/1241010 (https://phabricator.wikimedia.org/T416387) (owner: 10MVernon) [08:29:23] (03CR) 10Hashar: [C:03+1] "This can be deployed, Gerrit will need a restart in order for the config to be applied." [puppet] - 10https://gerrit.wikimedia.org/r/1238043 (https://phabricator.wikimedia.org/T416929) (owner: 10Hashar) [08:29:43] !log mszwarc@deploy2002 mszwarc, dcausse: Backport for [[gerrit:1240892|Ensure that sysops don't have '(oathauth-recover-for-user)' right (T417877)]], [[gerrit:1207758|cirrus: enable default_sort for completion on a set of wikis (T404858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:29:48] T417877: Document new rights and configuration in OATHAuth related to 2FA enforcement and grant the rights by default - https://phabricator.wikimedia.org/T417877 [08:29:49] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:29:54] dcausse: Do you have anything to test? [08:30:13] Msz2001: no, it can't be tested [08:30:22] ok, continuig [08:30:27] !log mszwarc@deploy2002 mszwarc, dcausse: Continuing with sync [08:32:08] (03PS1) 10Kosta Harlan: IPoid: Retry on intermittent network errors in OpenSearch fetcher [extensions/IPReputation] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242253 (https://phabricator.wikimedia.org/T417908) [08:32:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/IPReputation] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242253 (https://phabricator.wikimedia.org/T417908) (owner: 10Kosta Harlan) [08:32:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240769 (https://phabricator.wikimedia.org/T417910) (owner: 10Kosta Harlan) [08:33:25] I've added two patches to the window, please ping me when it's my turn :) [08:35:07] kostajh: we still have two other patches to go, but I can deploy yours together with some others [08:35:08] (03PS1) 10Kosta Harlan: HCaptchaEnterpriseHealthChecker: Use a cache hit for health check [extensions/ConfirmEdit] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242254 (https://phabricator.wikimedia.org/T412947) [08:35:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242254 (https://phabricator.wikimedia.org/T412947) (owner: 10Kosta Harlan) [08:35:36] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1242255 [08:36:17] Msz2001: the IPReputation patch and config patch can go out with something else, sure https://gerrit.wikimedia.org/r/c/1242253/ https://gerrit.wikimedia.org/r/c/1240769/ [08:36:32] the hCaptcha patch I'd like to deploy separately, I can do that at the end of the window https://gerrit.wikimedia.org/r/c/1242254/ [08:36:53] ack [08:40:25] (03Abandoned) 10Muehlenhoff: Revert "svg: refuse to generate SVGs larger than a particular size" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1218284 (https://phabricator.wikimedia.org/T411076) (owner: 10Muehlenhoff) [08:40:42] (03CR) 10CI reject: [V:04-1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1242255 (owner: 10Muehlenhoff) [08:40:55] (03CR) 10Muehlenhoff: [C:03+2] Remove buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1239673 (owner: 10Muehlenhoff) [08:42:01] matthiasmullie: Are you okay if I deploy your patch together with Kosta's and Anzx's patches? [08:42:06] sure! [08:42:20] okay, I'll do it in a sec, the previous deployment is about to finish [08:42:37] (03CR) 10Mszwarc: [C:03+2] "Approving manually to speed up deployment" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1241870 (owner: 10Matthias Mullie) [08:42:39] (03CR) 10Mszwarc: [C:03+2] "Approving manually to speed up deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240769 (https://phabricator.wikimedia.org/T417910) (owner: 10Kosta Harlan) [08:42:41] (03CR) 10Mszwarc: [C:03+2] "Approving manually to speed up deployment" [extensions/IPReputation] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242253 (https://phabricator.wikimedia.org/T417908) (owner: 10Kosta Harlan) [08:42:43] (03CR) 10Mszwarc: [C:03+2] "Approving manually to speed up deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241057 (https://phabricator.wikimedia.org/T417830) (owner: 10Anzx) [08:43:14] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240892|Ensure that sysops don't have '(oathauth-recover-for-user)' right (T417877)]], [[gerrit:1207758|cirrus: enable default_sort for completion on a set of wikis (T404858)]] (duration: 37m 07s) [08:43:20] T417877: Document new rights and configuration in OATHAuth related to 2FA enforcement and grant the rights by default - https://phabricator.wikimedia.org/T417877 [08:43:21] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:43:29] (03PS1) 10Majavah: admin: Remove buster_ssh_keys logic [puppet] - 10https://gerrit.wikimedia.org/r/1242257 [08:43:43] (03Merged) 10jenkins-bot: IPReputation: Lower IPoid request and connect timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240769 (https://phabricator.wikimedia.org/T417910) (owner: 10Kosta Harlan) [08:43:47] (03Merged) 10jenkins-bot: Lift IP cap for Editathon on commonswiki, eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241057 (https://phabricator.wikimedia.org/T417830) (owner: 10Anzx) [08:43:52] Msz2001: thanks for the deploy! :) [08:43:52] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1241870 (owner: 10Matthias Mullie) [08:43:53] (03Merged) 10jenkins-bot: IPoid: Retry on intermittent network errors in OpenSearch fetcher [extensions/IPReputation] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242253 (https://phabricator.wikimedia.org/T417908) (owner: 10Kosta Harlan) [08:43:58] yw :) [08:44:14] (03PS2) 10Arnaudb: Revert^2 "Gerrit: Disable auto reloading replication config" [puppet] - 10https://gerrit.wikimedia.org/r/1238043 (https://phabricator.wikimedia.org/T416929) (owner: 10Hashar) [08:44:23] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1241870|Squashed diff to master]], [[gerrit:1241057|Lift IP cap for Editathon on commonswiki, eswiki (T417830)]], [[gerrit:1242253|IPoid: Retry on intermittent network errors in OpenSearch fetcher (T417908)]], [[gerrit:1240769|IPReputation: Lower IPoid request and connect timeouts (T417910)]] [08:44:31] T417830: Lift IP cap on 2026-02-25/27 for Editathon for commonswiki and eswiki - https://phabricator.wikimedia.org/T417830 [08:44:31] T417908: IPReputation: Support multiple retries to improve resilience to timeouts - https://phabricator.wikimedia.org/T417908 [08:44:32] T417910: IPReputation: Lower the request and connect timeouts - https://phabricator.wikimedia.org/T417910 [08:44:33] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8121/console" [puppet] - 10https://gerrit.wikimedia.org/r/1242257 (owner: 10Majavah) [08:48:38] !log mszwarc@deploy2002 mlitn, kharlan, anzx, mszwarc: Backport for [[gerrit:1241870|Squashed diff to master]], [[gerrit:1241057|Lift IP cap for Editathon on commonswiki, eswiki (T417830)]], [[gerrit:1242253|IPoid: Retry on intermittent network errors in OpenSearch fetcher (T417908)]], [[gerrit:1240769|IPReputation: Lower IPoid request and connect timeouts (T417910)]] synced to the testservers (see https://wikitech.wikime [08:48:39] dia.org/wiki/Mwdebug). Changes can now be verified there. [08:49:09] matthiasmullie, kostajh: Is there anything to test for your patches? [08:49:47] I can do testing afterwards; no active code paths atm [08:49:52] Msz2001: no [08:49:57] Okay, continuing [08:50:05] !log mszwarc@deploy2002 mlitn, kharlan, anzx, mszwarc: Continuing with sync [08:50:59] (03PS1) 10Hashar: admin: hashar: disable fetch.prunetags [puppet] - 10https://gerrit.wikimedia.org/r/1242261 (https://phabricator.wikimedia.org/T418085) [08:52:39] (03PS1) 10Muehlenhoff: ssh: Remove support for buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1242264 [08:53:03] !log hashar@deploy2002 Started deploy [integration/docroot@1641910]: build: update misc dependencies 50ce133 11dba19 1641910 [08:53:16] !log hashar@deploy2002 Finished deploy [integration/docroot@1641910]: build: update misc dependencies 50ce133 11dba19 1641910 (duration: 00m 12s) [08:53:28] (03CR) 10CI reject: [V:04-1] ssh: Remove support for buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1242264 (owner: 10Muehlenhoff) [08:54:36] (03CR) 10Hashar: admin: hashar: sync .gitconfig (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224648 (owner: 10Hashar) [08:54:43] (03PS2) 10Muehlenhoff: ssh: Remove support for buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1242264 [08:55:32] kostajh: FYI: sync-prod is at 60% [08:56:28] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1241870|Squashed diff to master]], [[gerrit:1241057|Lift IP cap for Editathon on commonswiki, eswiki (T417830)]], [[gerrit:1242253|IPoid: Retry on intermittent network errors in OpenSearch fetcher (T417908)]], [[gerrit:1240769|IPReputation: Lower IPoid request and connect timeouts (T417910)]] (duration: 12m 05s) [08:56:37] T417830: Lift IP cap on 2026-02-25/27 for Editathon for commonswiki and eswiki - https://phabricator.wikimedia.org/T417830 [08:56:37] T417908: IPReputation: Support multiple retries to improve resilience to timeouts - https://phabricator.wikimedia.org/T417908 [08:56:38] T417910: IPReputation: Lower the request and connect timeouts - https://phabricator.wikimedia.org/T417910 [08:56:58] kostajh: You can go with your deployment [08:57:04] Msz2001: thanks [08:57:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242254 (https://phabricator.wikimedia.org/T412947) (owner: 10Kosta Harlan) [08:58:38] Msz2001: thanks for deploying, could you run maintenance script [08:58:47] I've already did so [08:58:53] ok thanks [08:59:02] yw [08:59:08] Msz2001: thanks! [09:05:13] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1240258 (https://phabricator.wikimedia.org/T417771) (owner: 10Majavah) [09:06:14] (03CR) 10Majavah: [V:03+1 C:03+2] puppetboard: Do not load fonts from external CDNs [puppet] - 10https://gerrit.wikimedia.org/r/1240258 (https://phabricator.wikimedia.org/T417771) (owner: 10Majavah) [09:06:24] (03CR) 10Muehlenhoff: [C:03+2] Run Gerrit spec tests on Bullseye/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1240703 (owner: 10Muehlenhoff) [09:08:48] kostajh: do you mind pinging me when done? i'd like to do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1240694 [09:09:46] (03Merged) 10jenkins-bot: HCaptchaEnterpriseHealthChecker: Use a cache hit for health check [extensions/ConfirmEdit] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242254 (https://phabricator.wikimedia.org/T412947) (owner: 10Kosta Harlan) [09:10:06] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1242254|HCaptchaEnterpriseHealthChecker: Use a cache hit for health check (T412947)]] [09:10:11] T412947: Reduce cache miss noise in memcached due to hcaptcha health checks - https://phabricator.wikimedia.org/T412947 [09:10:56] (03CR) 10Muehlenhoff: "check" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1242255 (owner: 10Muehlenhoff) [09:11:44] (03CR) 10Slyngshede: [C:03+1] "Makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/1240840 (owner: 10Muehlenhoff) [09:11:56] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1242254|HCaptchaEnterpriseHealthChecker: Use a cache hit for health check (T412947)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:13:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host apus-fe2004.codfw.wmnet with OS bookworm [09:13:15] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11639288 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host apus-fe2004.codfw.wmnet with OS bookworm [09:14:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and practically identical to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1242264, left one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1242257 (owner: 10Majavah) [09:14:39] (03Abandoned) 10Muehlenhoff: ssh: Remove support for buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1242264 (owner: 10Muehlenhoff) [09:14:54] !log kharlan@deploy2002 kharlan: Continuing with sync [09:15:07] urbanecm: yes, will let you know [09:15:10] ty [09:15:17] (03PS2) 10Majavah: admin: Remove buster_ssh_keys logic [puppet] - 10https://gerrit.wikimedia.org/r/1242257 [09:15:35] (03CR) 10Majavah: admin: Remove buster_ssh_keys logic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1242257 (owner: 10Majavah) [09:16:56] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1242257 (owner: 10Majavah) [09:17:34] (03CR) 10Majavah: [C:03+2] admin: Remove buster_ssh_keys logic [puppet] - 10https://gerrit.wikimedia.org/r/1242257 (owner: 10Majavah) [09:18:49] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1242254|HCaptchaEnterpriseHealthChecker: Use a cache hit for health check (T412947)]] (duration: 08m 43s) [09:18:54] T412947: Reduce cache miss noise in memcached due to hcaptcha health checks - https://phabricator.wikimedia.org/T412947 [09:19:39] urbanecm: over to you [09:19:46] thanks! [09:20:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240694 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [09:20:59] (03Merged) 10jenkins-bot: [Growth] Force legacy validation of GrowthMentorList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240694 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [09:21:17] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1240694|[Growth] Force legacy validation of GrowthMentorList (T417422)]] [09:21:21] T417422: Switch `GrowthMentorList` Community Configuration provider to JSON schema validation - https://phabricator.wikimedia.org/T417422 [09:21:35] (03PS2) 10Arnaudb: gerrit: swap gerrit-spare and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1242269 (https://phabricator.wikimedia.org/T406334) [09:21:35] (03CR) 10Arnaudb: "pcc output visible here: https://puppet-compiler.wmflabs.org/output/1242269/5862/" [puppet] - 10https://gerrit.wikimedia.org/r/1242269 (https://phabricator.wikimedia.org/T406334) (owner: 10Arnaudb) [09:22:07] (03PS1) 10Arnaudb: gerrit: swap gerrit-replica and gerrit-spare [dns] - 10https://gerrit.wikimedia.org/r/1242268 (https://phabricator.wikimedia.org/T417247) [09:23:16] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1240694|[Growth] Force legacy validation of GrowthMentorList (T417422)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:26:59] causing `Fatal exception of type "InvalidArgumentException"` everywhere is probably not the brightest idea [09:29:07] (03PS1) 10Muehlenhoff: Run the cephosd spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1242271 [09:29:34] (03PS1) 10TrainBranchBot: Revert "[Growth] Force legacy validation of GrowthMentorList" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242273 [09:29:35] (03CR) 10TrainBranchBot: "urbanecm@deploy2002 created a revert of this change as I3b60f152ddabae595675e000bb9eaf3a297533d6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240694 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [09:30:02] (03PS2) 10Urbanecm: Revert "[Growth] Force legacy validation of GrowthMentorList" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242273 (https://phabricator.wikimedia.org/T417422) (owner: 10TrainBranchBot) [09:30:12] (03CR) 10Urbanecm: [C:03+2] Revert "[Growth] Force legacy validation of GrowthMentorList" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242273 (https://phabricator.wikimedia.org/T417422) (owner: 10TrainBranchBot) [09:31:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-fe2004.codfw.wmnet with reason: host reimage [09:31:48] (03Merged) 10jenkins-bot: Revert "[Growth] Force legacy validation of GrowthMentorList" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242273 (https://phabricator.wikimedia.org/T417422) (owner: 10TrainBranchBot) [09:35:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-fe2004.codfw.wmnet with reason: host reimage [09:36:30] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1242273|Revert "[Growth] Force legacy validation of GrowthMentorList" (T417422)]] [09:36:33] T417422: Switch `GrowthMentorList` Community Configuration provider to JSON schema validation - https://phabricator.wikimedia.org/T417422 [09:42:30] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1242273|Revert "[Growth] Force legacy validation of GrowthMentorList" (T417422)]] (duration: 06m 00s) [09:42:34] T417422: Switch `GrowthMentorList` Community Configuration provider to JSON schema validation - https://phabricator.wikimedia.org/T417422 [09:43:57] (03PS4) 10Arnaudb: gerrit: swap gerrit-spare and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1242269 (https://phabricator.wikimedia.org/T406334) [09:43:57] (03CR) 10Arnaudb: "pcc output visible here: https://puppet-compiler.wmflabs.org/output/1242269/5863/" [puppet] - 10https://gerrit.wikimedia.org/r/1242269 (https://phabricator.wikimedia.org/T406334) (owner: 10Arnaudb) [09:44:33] (03PS3) 10Arnaudb: gerrit: disable service on gerrit2002 to reimage [puppet] - 10https://gerrit.wikimedia.org/r/1242272 (https://phabricator.wikimedia.org/T417247) [09:44:33] (03CR) 10Arnaudb: "pcc output visible here: https://puppet-compiler.wmflabs.org/output/1242272/5864/" [puppet] - 10https://gerrit.wikimedia.org/r/1242272 (https://phabricator.wikimedia.org/T417247) (owner: 10Arnaudb) [09:45:14] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098 (10mikez-WMF) 03NEW [09:46:35] (03CR) 10Blake: [C:03+1] alertmanager: Also add ServiceOps to mw-cron tasks for unstewarded components [puppet] - 10https://gerrit.wikimedia.org/r/1238369 (https://phabricator.wikimedia.org/T417020) (owner: 10A smart kitten) [09:51:07] (03PS1) 10Urbanecm: Revert^2 "[Growth] Force legacy validation of GrowthMentorList" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242283 (https://phabricator.wikimedia.org/T417422) [09:52:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad A/B switch cabling documentation - https://phabricator.wikimedia.org/T418018#11639579 (10ayounsi) a:03Papaul Thanks @Papaul We went with CWDM for the spine/leaf in rows C/D (T404103 and T396065#11047207), so it's best to do t... [09:53:14] (03PS1) 10Arnaudb: gerrit: prepare replication resume for gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1242275 (https://phabricator.wikimedia.org/T338470) [09:53:14] (03CR) 10Arnaudb: "pcc output visible here: https://puppet-compiler.wmflabs.org/output/1242275/5865/" [puppet] - 10https://gerrit.wikimedia.org/r/1242275 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [09:53:50] (03PS1) 10Arnaudb: gerrit: resume replication on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1242279 (https://phabricator.wikimedia.org/T417247) [09:53:51] (03CR) 10Arnaudb: "pcc output visible here: https://puppet-compiler.wmflabs.org/output/1242279/5866/" [puppet] - 10https://gerrit.wikimedia.org/r/1242279 (https://phabricator.wikimedia.org/T417247) (owner: 10Arnaudb) [09:54:56] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [09:55:19] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [09:56:01] (03PS2) 10Urbanecm: Revert^2 "[Growth] Force legacy validation of GrowthMentorList" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242283 (https://phabricator.wikimedia.org/T417422) [09:56:44] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [09:57:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [09:57:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-fe2004.codfw.wmnet with OS bookworm [09:57:20] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11639621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host apus-fe2004.codfw.wmnet with OS bookworm completed: - apu... [09:57:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host apus-fe2005.codfw.wmnet with OS bookworm [09:57:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11639624 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host apus-fe2005.codfw.wmnet with OS bookworm [10:02:10] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11639650 (10Volans) >>! In T330997#11635578, @Blake wrote: > I think I'd be inclined to prefer the more-defensi... [10:02:44] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1241012 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [10:03:16] (03PS2) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1242255 [10:07:54] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [10:08:28] (03CR) 10Elukey: [C:03+2] profile::puppetserver: rework and fix the analytics-sre config [puppet] - 10https://gerrit.wikimedia.org/r/1241012 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [10:08:45] (03CR) 10CI reject: [V:04-1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1242255 (owner: 10Muehlenhoff) [10:09:35] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [10:15:26] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-fe2005.codfw.wmnet with reason: host reimage [10:16:55] (03PS1) 10Kamila Součková: conftool-data: add wikikube-worker2356 to test nokia switches [puppet] - 10https://gerrit.wikimedia.org/r/1242286 [10:19:00] (03CR) 10Blake: [C:03+1] conftool-data: add wikikube-worker2356 to test nokia switches [puppet] - 10https://gerrit.wikimedia.org/r/1242286 (owner: 10Kamila Součková) [10:19:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-fe2005.codfw.wmnet with reason: host reimage [10:19:24] (03CR) 10Kamila Součková: [C:03+2] conftool-data: add wikikube-worker2356 to test nokia switches [puppet] - 10https://gerrit.wikimedia.org/r/1242286 (owner: 10Kamila Součková) [10:19:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:21:23] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [10:22:13] (03CR) 10Ayounsi: "hostname lgtm, however I'm not familiar with conftool" [puppet] - 10https://gerrit.wikimedia.org/r/1242286 (owner: 10Kamila Součková) [10:23:04] (03PS3) 10JMeybohm: k8s-staging: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1240275 (https://phabricator.wikimedia.org/T352956) [10:23:04] (03PS1) 10JMeybohm: k8s-staging: Switch scheduler from wrr to mh [puppet] - 10https://gerrit.wikimedia.org/r/1242288 (https://phabricator.wikimedia.org/T352956) [10:23:06] (03PS1) 10JMeybohm: kubestagemaster: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1242289 (https://phabricator.wikimedia.org/T352956) [10:23:55] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [10:25:04] (03PS1) 10Muehlenhoff: Default to git protocol v2 fleet-wide [puppet] - 10https://gerrit.wikimedia.org/r/1242290 [10:25:10] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11639742 (10KSiebert) @MatthewVernon Can you check what other permissions are required to let @MBinder_WMF see the dashboard? I think when I made the request, I also asked... [10:26:51] (03CR) 10Michael Große: [C:03+1] Revert^2 "[Growth] Force legacy validation of GrowthMentorList" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242283 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:28:06] (03PS1) 10Majavah: opesntack: encapi: Fix column name [puppet] - 10https://gerrit.wikimedia.org/r/1242291 (https://phabricator.wikimedia.org/T416588) [10:28:32] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240275 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [10:28:33] (03PS1) 10Muehlenhoff: Remove obsolete acct toil class [puppet] - 10https://gerrit.wikimedia.org/r/1242292 [10:30:34] (03CR) 10FNegri: [C:03+1] opesntack: encapi: Fix column name [puppet] - 10https://gerrit.wikimedia.org/r/1242291 (https://phabricator.wikimedia.org/T416588) (owner: 10Majavah) [10:30:54] (03PS1) 10Muehlenhoff: udev: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1242294 [10:32:30] (03PS1) 10Urbanecm: Validate mentor list using a JSON schema [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242295 (https://phabricator.wikimedia.org/T417422) [10:33:54] (03PS1) 10Urbanecm: Temporarily switch mentor list validation to legacy validator [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242296 (https://phabricator.wikimedia.org/T417422) [10:34:40] (03PS1) 10Muehlenhoff: Remove support for prometheus node exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/1242297 [10:35:26] (03CR) 10Urbanecm: [C:03+2] Validate mentor list using a JSON schema [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242295 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:35:31] (03CR) 10Urbanecm: [C:03+2] Temporarily switch mentor list validation to legacy validator [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242296 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:36:05] !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2356.codfw.wmnet [10:36:58] !log kamila@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker2356.codfw.wmnet [10:38:23] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11639783 (10MatthewVernon) Looking at [[ https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#Access_Groups | the access groups documentation ]], `analytics-priva... [10:39:02] (03CR) 10Urbanecm: Temporarily switch mentor list validation to legacy validator [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242296 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:39:06] (03CR) 10Urbanecm: Validate mentor list using a JSON schema [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242295 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:39:49] (03PS1) 10Urbanecm: cleanup: Remove unused code [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242298 [10:39:59] (03PS2) 10Urbanecm: Validate mentor list using a JSON schema [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242295 (https://phabricator.wikimedia.org/T417422) [10:40:05] (03PS2) 10Urbanecm: Temporarily switch mentor list validation to legacy validator [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242296 (https://phabricator.wikimedia.org/T417422) [10:40:11] (03CR) 10CI reject: [V:04-1] Temporarily switch mentor list validation to legacy validator [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242296 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:40:48] (03CR) 10Urbanecm: [C:03+2] cleanup: Remove unused code [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242298 (owner: 10Urbanecm) [10:40:52] (03CR) 10Urbanecm: [C:03+2] Validate mentor list using a JSON schema [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242295 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:40:56] (03CR) 10Urbanecm: [C:03+2] Temporarily switch mentor list validation to legacy validator [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242296 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:41:08] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [10:41:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [10:41:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-fe2005.codfw.wmnet with OS bookworm [10:42:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11639786 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host apus-fe2005.codfw.wmnet with OS bookworm completed: - apu... [10:42:26] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11639787 (10MatthewVernon) [10:42:30] !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2356.codfw.wmnet [10:42:56] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387#11639788 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Yep, setting preseed to expect UEFI booting fixed things. [10:43:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242294 (owner: 10Muehlenhoff) [10:43:23] !log kamila@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker2356.codfw.wmnet [10:48:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242298 (owner: 10Urbanecm) [10:48:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242295 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:48:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242296 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:48:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242292 (owner: 10Muehlenhoff) [10:49:30] (03CR) 10Volans: "Thanks a lot Blake for starting this! I've left a main comment in the commit message and some questions inline in the code." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [10:50:11] (03CR) 10JMeybohm: [C:03+1] role::kafka::test: prepare the cluster for the Kafka upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1239142 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [10:50:41] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240888 (https://phabricator.wikimedia.org/T411769) (owner: 10Clément Goubert) [10:51:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242290 (owner: 10Muehlenhoff) [10:52:12] (03CR) 10Hashar: "Because the GitHub replication was no more configured due to the replication plugin issue? Then replications were not scheduled, which mea" [puppet] - 10https://gerrit.wikimedia.org/r/1238315 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [10:52:19] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [10:52:28] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [10:53:03] (03Merged) 10jenkins-bot: cleanup: Remove unused code [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242298 (owner: 10Urbanecm) [10:55:07] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [10:55:24] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [10:56:39] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [10:56:45] (03Merged) 10jenkins-bot: Validate mentor list using a JSON schema [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242295 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:56:47] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [10:56:48] (03CR) 10CI reject: [V:04-1] Temporarily switch mentor list validation to legacy validator [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242296 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:57:09] (03PS1) 10Muehlenhoff: Remove obsolete config override for git protocol v2 [puppet] - 10https://gerrit.wikimedia.org/r/1242299 [10:57:16] (03CR) 10Urbanecm: [V:03+2 C:03+2] "flaky test" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242296 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [10:57:47] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [10:57:55] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [10:57:57] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1242298|cleanup: Remove unused code]], [[gerrit:1242295|Validate mentor list using a JSON schema (T417422)]], [[gerrit:1242296|Temporarily switch mentor list validation to legacy validator (T417422)]] [10:58:01] T417422: Switch `GrowthMentorList` Community Configuration provider to JSON schema validation - https://phabricator.wikimedia.org/T417422 [10:58:38] !log jayme@cumin1003 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster staging-codfw: trixie upgrade [10:59:50] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1242298|cleanup: Remove unused code]], [[gerrit:1242295|Validate mentor list using a JSON schema (T417422)]], [[gerrit:1242296|Temporarily switch mentor list validation to legacy validator (T417422)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T1100) [11:00:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242299 (owner: 10Muehlenhoff) [11:00:41] !log urbanecm@deploy2002 urbanecm: Continuing with sync [11:02:39] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kubestagemaster2003.codfw.wmnet with OS trixie [11:03:25] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:03:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:04:25] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:04:43] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1242298|cleanup: Remove unused code]], [[gerrit:1242295|Validate mentor list using a JSON schema (T417422)]], [[gerrit:1242296|Temporarily switch mentor list validation to legacy validator (T417422)]] (duration: 06m 46s) [11:04:48] T417422: Switch `GrowthMentorList` Community Configuration provider to JSON schema validation - https://phabricator.wikimedia.org/T417422 [11:04:51] (03CR) 10JMeybohm: [C:03+1] service mesh: Add page-analytics listener [puppet] - 10https://gerrit.wikimedia.org/r/1240888 (https://phabricator.wikimedia.org/T411769) (owner: 10Clément Goubert) [11:05:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:07:35] (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/1242315 (https://phabricator.wikimedia.org/T414656) [11:08:43] (03PS1) 10Marostegui: dbproxy1029: Remove line [puppet] - 10https://gerrit.wikimedia.org/r/1242347 [11:08:55] (03CR) 10Marostegui: [C:03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/1242315 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [11:08:59] !log marostegui@dns1006 START - running authdns-update [11:09:14] !log Failover m5-master T414656 [11:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:17] T414656: Migrate dbproxy* to Debian Trixie - https://phabricator.wikimedia.org/T414656 [11:09:40] (03CR) 10Marostegui: [C:03+2] dbproxy1029: Remove line [puppet] - 10https://gerrit.wikimedia.org/r/1242347 (owner: 10Marostegui) [11:09:55] (03CR) 10Marostegui: [C:03+2] "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1242347 (owner: 10Marostegui) [11:10:18] (03PS13) 10Blake: locking: Add a mechanism for a global Spicerack lock. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) [11:10:19] !log marostegui@dns1006 END - running authdns-update [11:11:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242297 (owner: 10Muehlenhoff) [11:12:21] (03CR) 10Blake: locking: Add a mechanism for a global Spicerack lock. (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [11:13:32] (03PS1) 10Kamila Součková: admin/common-bgp: add F4 ToR switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242351 [11:13:57] (03PS3) 10Dreamy Jazz: Drop $wgIPReputationEnableLoginCaptchaIfIPKnown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240227 (https://phabricator.wikimedia.org/T416941) [11:14:46] jouncebot: nowandnext [11:14:46] For the next 0 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T1100) [11:14:46] In 2 hour(s) and 45 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T1400) [11:15:01] (03PS2) 10Kamila Součková: admin/common-bgp: add F4 ToR switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242351 (https://phabricator.wikimedia.org/T417817) [11:15:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240227 (https://phabricator.wikimedia.org/T416941) (owner: 10Dreamy Jazz) [11:15:41] (03CR) 10Dreamy Jazz: Drop $wgIPReputationEnableLoginCaptchaIfIPKnown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240227 (https://phabricator.wikimedia.org/T416941) (owner: 10Dreamy Jazz) [11:15:45] (03CR) 10Dreamy Jazz: [C:03+2] Drop $wgIPReputationEnableLoginCaptchaIfIPKnown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240227 (https://phabricator.wikimedia.org/T416941) (owner: 10Dreamy Jazz) [11:17:33] (03PS1) 10Clément Goubert: envoy: Allow immediate draining in drain-envoy.sh [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242353 (https://phabricator.wikimedia.org/T364245) [11:17:57] (03Merged) 10jenkins-bot: Drop $wgIPReputationEnableLoginCaptchaIfIPKnown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240227 (https://phabricator.wikimedia.org/T416941) (owner: 10Dreamy Jazz) [11:20:46] (03PS1) 10Clément Goubert: mw-debug: Immediately drain envoy on termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242354 (https://phabricator.wikimedia.org/T364245) [11:21:53] !log start reef 18.2.7 upgrade of eqiad apus storage nodes T417396 [11:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:58] T417396: Upgrade apus' ceph to 18.2.7 (or .8 if already available) - https://phabricator.wikimedia.org/T417396 [11:22:57] (03CR) 10Ayounsi: [C:03+2] nftables: define NETWORK_INFRA [puppet] - 10https://gerrit.wikimedia.org/r/1240931 (owner: 10Ayounsi) [11:23:06] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage [11:25:29] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1240227|Drop $wgIPReputationEnableLoginCaptchaIfIPKnown (T416941)]] [11:25:33] T416941: IPReputation: Remove IPReputationEnableLoginCaptchaIfIPKnown and associated logic - https://phabricator.wikimedia.org/T416941 [11:26:40] (03CR) 10Ayounsi: [C:03+1] admin/common-bgp: add F4 ToR switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242351 (https://phabricator.wikimedia.org/T417817) (owner: 10Kamila Součková) [11:27:14] (03CR) 10Kamila Součková: [C:03+2] admin/common-bgp: add F4 ToR switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242351 (https://phabricator.wikimedia.org/T417817) (owner: 10Kamila Součková) [11:27:44] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1240227|Drop $wgIPReputationEnableLoginCaptchaIfIPKnown (T416941)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:28:15] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage [11:28:36] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [11:32:19] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11640007 (10Rsilvola) [11:32:27] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240227|Drop $wgIPReputationEnableLoginCaptchaIfIPKnown (T416941)]] (duration: 06m 58s) [11:32:31] T416941: IPReputation: Remove IPReputationEnableLoginCaptchaIfIPKnown and associated logic - https://phabricator.wikimedia.org/T416941 [11:32:56] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11640023 (10Rsilvola) 05Open→03Declined Hello @Dzahn, Much of this was already filled in [T4... [11:34:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc1011: Depooling pc1 [11:34:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [11:35:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:35:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1011: Depooling pc1 [11:35:48] (03Merged) 10jenkins-bot: admin/common-bgp: add F4 ToR switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242351 (https://phabricator.wikimedia.org/T417817) (owner: 10Kamila Součková) [11:35:53] (03PS1) 10Marostegui: pc1011,pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1242358 (https://phabricator.wikimedia.org/T417626) [11:36:07] !log marostegui@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Upgrade to debian trixie [11:36:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Upgrade to debian trixie [11:37:05] (03CR) 10Marostegui: [C:03+2] pc1011,pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1242358 (https://phabricator.wikimedia.org/T417626) (owner: 10Marostegui) [11:37:11] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:37:43] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:38:18] (03PS4) 10Muehlenhoff: pmacct: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/954287 [11:38:59] (03PS5) 10Muehlenhoff: pmacct: Avoid Ferm-specific syntax and convert to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/954287 [11:39:16] (03PS5) 10Clément Goubert: api-gateway: Add external services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225548 (https://phabricator.wikimedia.org/T414333) [11:39:53] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host pc1011.eqiad.wmnet with OS trixie [11:40:03] (03PS6) 10Clément Goubert: api-gateway: Add external services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225548 (https://phabricator.wikimedia.org/T414333) [11:40:27] (03Abandoned) 10Muehlenhoff: Default to git protocol v2 fleet-wide [puppet] - 10https://gerrit.wikimedia.org/r/1242290 (owner: 10Muehlenhoff) [11:41:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954287 (owner: 10Muehlenhoff) [11:47:36] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Add external services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225548 (https://phabricator.wikimedia.org/T414333) (owner: 10Clément Goubert) [11:48:34] !log start reef 18.2.7 upgrade of eqiad apus frontends T417396 [11:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:38] T417396: Upgrade apus' ceph to 18.2.7 (or .8 if already available) - https://phabricator.wikimedia.org/T417396 [11:48:52] (03CR) 10Ayounsi: [C:03+1] pmacct: Avoid Ferm-specific syntax and convert to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/954287 (owner: 10Muehlenhoff) [11:49:17] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2003.codfw.wmnet with OS trixie [11:49:23] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=0) Reimaging k8s control planes of cluster staging-codfw: trixie upgrade [11:50:16] (03Merged) 10jenkins-bot: api-gateway: Add external services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225548 (https://phabricator.wikimedia.org/T414333) (owner: 10Clément Goubert) [11:50:38] (03PS1) 10Muehlenhoff: fastnetmon: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1242361 [11:52:08] (03PS1) 10Muehlenhoff: samplicator: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1242362 [11:52:40] (03CR) 10CI reject: [V:04-1] samplicator: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1242362 (owner: 10Muehlenhoff) [11:53:13] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1011.eqiad.wmnet with reason: host reimage [11:53:24] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:53:44] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:53:54] !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2356.codfw.wmnet [11:54:46] !log kamila@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker2356.codfw.wmnet [11:56:37] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:56:49] !log derick@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=ptwiki --logwiki=metawiki 'Bianca Fernandes Dias' Greenlighrts # T418113 [11:56:57] T418113: Unblock stuck global rename of Greenlighrts - https://phabricator.wikimedia.org/T418113 [11:57:03] 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Upgrade apus' ceph to 18.2.7 (or .8 if already available) - https://phabricator.wikimedia.org/T417396#11640088 (10MatthewVernon) eqiad cluster done. [11:57:06] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:58:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1011.eqiad.wmnet with reason: host reimage [12:00:03] (03CR) 10Muehlenhoff: [C:03+2] pmacct: Avoid Ferm-specific syntax and convert to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/954287 (owner: 10Muehlenhoff) [12:00:21] jouncebot: nowandnext [12:00:21] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [12:00:21] In 1 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T1400) [12:01:28] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:01:30] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:01:36] 06SRE, 06Data-Engineering, 06Data-Platform-SRE: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733#11640125 (10Aklapper) a:05Ottomata→03None @Ottomata Removing task assignee as this open task has been assigned for more than two year... [12:01:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242361 (owner: 10Muehlenhoff) [12:03:30] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:04:03] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:04:10] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-Needs-Improvement: Audit cloud filters on CR in respect of new cloud-private and public VIP networks - https://phabricator.wikimedia.org/T347030#11640137 (10Aklapper) a:05cmooney→03None @cmooney Removing task assignee as this open task has been ass... [12:04:16] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-Needs-Improvement: test_matching_vlan() function crashing in Netbox network report - https://phabricator.wikimedia.org/T339133#11640139 (10Aklapper) a:05cmooney→03None @cmooney Removing task assignee as this open task has been assigned for more tha... [12:04:20] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:04:25] 07Puppet, 06SRE, 06Infrastructure-Foundations, 10netbox, 10observability: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272#11640141 (10Aklapper) a:05cmooney→03None @cmooney Removing task assignee as this open task has been assigned for more... [12:04:43] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [12:05:19] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [12:06:22] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [12:06:30] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:06:51] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [12:06:52] (03PS2) 10Muehlenhoff: samplicator: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1242362 [12:06:59] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [12:08:27] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [12:16:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242362 (owner: 10Muehlenhoff) [12:17:03] (03CR) 10Ayounsi: [C:03+1] fastnetmon: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1242361 (owner: 10Muehlenhoff) [12:17:53] (03PS3) 10Urbanecm: Revert^2 "[Growth] Force legacy validation of GrowthMentorList" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242283 (https://phabricator.wikimedia.org/T417422) [12:18:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242283 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [12:18:22] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:18:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1011.eqiad.wmnet with OS trixie [12:19:44] (03CR) 10JMeybohm: [C:03+1] envoy: Allow immediate draining in drain-envoy.sh [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242353 (https://phabricator.wikimedia.org/T364245) (owner: 10Clément Goubert) [12:19:50] (03CR) 10JMeybohm: [C:03+1] mw-debug: Immediately drain envoy on termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242354 (https://phabricator.wikimedia.org/T364245) (owner: 10Clément Goubert) [12:19:52] (03PS1) 10Ayounsi: Add BGP neighbors IPs for codfw E/F racks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242366 (https://phabricator.wikimedia.org/T417817) [12:20:00] (03Merged) 10jenkins-bot: Revert^2 "[Growth] Force legacy validation of GrowthMentorList" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242283 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [12:20:16] (03CR) 10Muehlenhoff: [C:03+2] fastnetmon: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1242361 (owner: 10Muehlenhoff) [12:20:18] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1242283|Revert^2 "[Growth] Force legacy validation of GrowthMentorList" (T417422)]] [12:20:22] T417422: Switch `GrowthMentorList` Community Configuration provider to JSON schema validation - https://phabricator.wikimedia.org/T417422 [12:20:47] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host pc2011.codfw.wmnet with OS trixie [12:22:11] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1242283|Revert^2 "[Growth] Force legacy validation of GrowthMentorList" (T417422)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:22:48] (03CR) 10Ayounsi: [C:03+1] samplicator: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1242362 (owner: 10Muehlenhoff) [12:23:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:24:02] (03CR) 10Muehlenhoff: [C:03+2] samplicator: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1242362 (owner: 10Muehlenhoff) [12:24:08] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:24:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:24:37] !log jayme@cumin1003 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster staging-codfw: trixie upgrade [12:25:08] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:26:28] (03CR) 10Kamila Součková: [C:03+1] Add BGP neighbors IPs for codfw E/F racks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242366 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [12:28:09] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:28:19] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kubestagemaster2004.codfw.wmnet with OS trixie [12:29:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:29:59] jouncebot: nowandnext [12:30:00] No deployments scheduled for the next 1 hour(s) and 30 minute(s) [12:30:00] In 1 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T1400) [12:30:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:30:52] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:30:58] (03PS1) 10Muehlenhoff: Switch netflow7002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1242371 [12:30:58] (03PS1) 10Muehlenhoff: Switch the netinsights role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1242372 [12:31:09] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:32:10] !log urbanecm@deploy2002 urbanecm: Continuing with sync [12:32:21] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:32:44] (03CR) 10Clément Goubert: [V:03+2 C:03+2] envoy: Allow immediate draining in drain-envoy.sh [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242353 (https://phabricator.wikimedia.org/T364245) (owner: 10Clément Goubert) [12:32:51] !log kamila@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:33:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242371 (owner: 10Muehlenhoff) [12:33:22] !log kamila@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:33:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:33:41] !log kamila@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [12:34:01] !log kamila@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [12:34:03] !log Rebuilding envoy image - T364245 [12:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:08] T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis - https://phabricator.wikimedia.org/T364245 [12:34:09] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:34:12] !log kamila@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [12:34:37] !log kamila@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [12:35:09] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:35:22] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:35:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:35:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:36:09] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1242283|Revert^2 "[Growth] Force legacy validation of GrowthMentorList" (T417422)]] (duration: 15m 50s) [12:36:11] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:36:13] T417422: Switch `GrowthMentorList` Community Configuration provider to JSON schema validation - https://phabricator.wikimedia.org/T417422 [12:36:27] (03CR) 10Clément Goubert: [C:03+2] service mesh: Add page-analytics listener [puppet] - 10https://gerrit.wikimedia.org/r/1240888 (https://phabricator.wikimedia.org/T411769) (owner: 10Clément Goubert) [12:37:12] (03PS1) 10Ayounsi: Add BGP neighbors IPs for eqiad C/D racks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242380 (https://phabricator.wikimedia.org/T417817) [12:37:20] (03PS2) 10Muehlenhoff: puppetdb: Drop firewall rule for access to Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1239647 (https://phabricator.wikimedia.org/T365798) [12:37:58] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2011.codfw.wmnet with reason: host reimage [12:38:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1239647 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:39:32] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:39:53] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:40:19] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:40:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:41:31] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:42:08] (03CR) 10Ayounsi: [C:03+1] Switch netflow7002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1242371 (owner: 10Muehlenhoff) [12:42:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2011.codfw.wmnet with reason: host reimage [12:43:42] jouncebot: nowandnext [12:43:42] No deployments scheduled for the next 1 hour(s) and 16 minute(s) [12:43:42] In 1 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T1400) [12:43:48] (03PS3) 10Dreamy Jazz: Filter for suppressed usernames [extensions/CheckUser] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242370 (https://phabricator.wikimedia.org/T417868) [12:44:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242370 (https://phabricator.wikimedia.org/T417868) (owner: 10Dreamy Jazz) [12:45:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10Recommendation-API: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11640366 (10Jclark-ctr) @elukey i have tried a few times unable to get these to provision Last message before failure is below. i have DM you passwords from labels ` di... [12:45:29] (03CR) 10Clément Goubert: wikifeeds: Add request definition for page analytics (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [12:46:16] (03CR) 10Clément Goubert: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [12:48:00] (03CR) 10Filippo Giunchedi: [C:03+1] udev: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1242294 (owner: 10Muehlenhoff) [12:48:17] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2004.codfw.wmnet with reason: host reimage [12:48:30] !log start reef 18.2.7 upgrade of codfw apus storage nodes T417396 [12:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:34] T417396: Upgrade apus' ceph to 18.2.7 (or .8 if already available) - https://phabricator.wikimedia.org/T417396 [12:49:33] (03CR) 10Kamila Součková: [C:03+2] Add BGP neighbors IPs for codfw E/F racks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242366 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [12:50:28] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:53:53] (03CR) 10Kamila Součková: [C:03+2] Add BGP neighbors IPs for eqiad C/D racks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242380 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [12:54:18] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2004.codfw.wmnet with reason: host reimage [12:54:52] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1240804 (https://phabricator.wikimedia.org/T411404) (owner: 10Kamila Součková) [12:55:02] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [12:55:28] FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:55:46] (03CR) 10Kamila Součková: [C:03+2] admin: update ssh keys for kamila [puppet] - 10https://gerrit.wikimedia.org/r/1240804 (https://phabricator.wikimedia.org/T411404) (owner: 10Kamila Součková) [12:56:53] (03Merged) 10jenkins-bot: Filter for suppressed usernames [extensions/CheckUser] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242370 (https://phabricator.wikimedia.org/T417868) (owner: 10Dreamy Jazz) [12:57:15] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1242370|Filter for suppressed usernames (T417868)]] [12:57:20] (03Merged) 10jenkins-bot: Add BGP neighbors IPs for codfw E/F racks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242366 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [12:57:40] (03CR) 10EarlyWarningBot: "[Failed command](https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-vendor-mysql-php83/15182/consoleFull): `composer r" [extensions/CheckUser] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242370 (https://phabricator.wikimedia.org/T417868) (owner: 10Dreamy Jazz) [12:58:38] 10SRE-Access-Requests, 13Patch-For-Review: Update SSH key for kamila - https://phabricator.wikimedia.org/T411404#11640400 (10Raine) 05Open→03Resolved a:03Raine Merged, thanks! [12:58:51] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt backup1016-20 - jclark@cumin1003" [12:58:56] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt backup1016-20 - jclark@cumin1003" [12:58:56] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:59:09] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1242370|Filter for suppressed usernames (T417868)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:00:27] FIRING: [5x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:02:02] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [13:02:17] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host backup1019.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:02:22] (03PS1) 10Slyngshede: Inform about gitlab profile updating quirks [software/bitu] - 10https://gerrit.wikimedia.org/r/1242389 (https://phabricator.wikimedia.org/T416898) [13:02:25] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host backup1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:02:33] (03Merged) 10jenkins-bot: Add BGP neighbors IPs for eqiad C/D racks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242380 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [13:02:33] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host backup1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:02:37] (03PS3) 10Urbanecm: [Growth] beta: Enable new GrowthMentorList validation on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240697 (https://phabricator.wikimedia.org/T417422) [13:02:48] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host backup1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:02:56] (03CR) 10Muehlenhoff: [C:03+2] Switch netflow7002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1242371 (owner: 10Muehlenhoff) [13:04:32] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:05:18] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host backup1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:05:28] FIRING: [6x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:05:33] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11640417 (10Clement_Goubert) >>! In T330997#11639650, @Volans wrote: >>>! In T330997#11635578, @Blake wrote: >>... [13:05:54] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1242370|Filter for suppressed usernames (T417868)]] (duration: 08m 39s) [13:06:04] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:06:04] I'm done with deploys [13:06:20] !log kamila@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:06:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2011.codfw.wmnet with OS trixie [13:06:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240697 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [13:07:09] !log kamila@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:07:39] (03Merged) 10jenkins-bot: [Growth] beta: Enable new GrowthMentorList validation on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240697 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [13:08:32] !log kamila@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:09:17] !log kamila@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:10:27] FIRING: [7x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:10:34] !log kamila@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [13:11:26] !log kamila@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:12:05] !log kamila@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:13:20] (03PS4) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240672 (https://phabricator.wikimedia.org/T411485) [13:13:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:32] (03PS1) 10Urbanecm: [Growth] Log read failures when JSON schema validation is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242392 (https://phabricator.wikimedia.org/T417422) [13:13:36] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:13:46] !log kamila@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:14:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow7002.magru.wmnet [13:14:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:15:27] FIRING: [7x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:16:22] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2004.codfw.wmnet with OS trixie [13:16:36] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:17:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow7002.magru.wmnet [13:20:00] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kubestagemaster2005.codfw.wmnet with OS trixie [13:20:27] FIRING: [7x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:22:29] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:23:05] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1019.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:24:04] !log start reef 18.2.7 upgrade of codfw apus frontends T417396 [13:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:07] T417396: Upgrade apus' ceph to 18.2.7 (or .8 if already available) - https://phabricator.wikimedia.org/T417396 [13:24:12] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-be1096.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:24:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:24:13] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-be1097.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:24:25] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:25:27] FIRING: [7x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:25:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11640479 (10Jclark-ctr) [13:27:09] (03PS3) 10Muehlenhoff: Obsolete airflow-analytics-product-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1240336 [13:28:22] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:29:22] jclark@cumin1003 provision (PID 358812) is awaiting input [13:30:27] FIRING: [6x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2010:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:31:39] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:31:39] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:33:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:34:11] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#11640503 (10Tgr) Or the output stream gets misconfigured... [13:37:49] jclark@cumin1003 provision (PID 358807) is awaiting input [13:37:53] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [13:38:03] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8122/co" [puppet] - 10https://gerrit.wikimedia.org/r/1242288 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [13:38:45] (03CR) 10Vgutierrez: [V:03+1 C:03+1] k8s-staging: Switch scheduler from wrr to mh [puppet] - 10https://gerrit.wikimedia.org/r/1242288 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [13:43:17] FIRING: [3x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:44:46] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [13:44:50] 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Upgrade apus' ceph to 18.2.7 (or .8 if already available) - https://phabricator.wikimedia.org/T417396#11640540 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon codfw cluster done, too. [13:45:15] jclark@cumin1003 provision (PID 358812) is awaiting input [13:45:21] (03CR) 10Vgutierrez: [C:03+1] k8s-staging: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1240275 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [13:45:41] (03CR) 10Vgutierrez: [C:03+1] kubestagemaster: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1242289 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [13:46:12] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11640558 (10Ladsgroup) >>! In T414805#11637548, @Tacsipacsi wrote: >>>! In T414805#11636273, @Ladsgroup wrote: >> The rate limit is a... [13:48:17] FIRING: [6x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:13] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:50:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1096.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:50:22] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:50:55] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:51:47] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:52:14] (03CR) 10Fabfur: [C:03+2] cache::upload: enable global ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1237246 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:52:22] !log kamila@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:53:17] FIRING: [8x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:23] !log kamila@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:53:32] jclark@cumin1003 provision (PID 358812) is awaiting input [13:54:23] !log kamila@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [13:55:08] (03PS1) 10Muehlenhoff: mariadb:packages: Remove spec test [puppet] - 10https://gerrit.wikimedia.org/r/1242398 [13:55:09] !log kamila@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [13:56:09] (03CR) 10Marostegui: [C:03+1] mariadb:packages: Remove spec test [puppet] - 10https://gerrit.wikimedia.org/r/1242398 (owner: 10Muehlenhoff) [13:57:38] !log kamila@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:57:43] (03Abandoned) 10Arnaudb: gerrit: add mtail monitoring on replication [puppet] - 10https://gerrit.wikimedia.org/r/1238315 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [13:58:10] !log kamila@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:58:25] !log kamila@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:59:20] (03PS1) 10Arnaudb: gerrit: alert for broken replication [alerts] - 10https://gerrit.wikimedia.org/r/1242399 (https://phabricator.wikimedia.org/T418084) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T1400). [14:00:05] MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:23] o/ I can’t deploy 😔 [14:00:28] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:00:35] !log kamila@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:01:17] hi [14:01:36] !log added Calico BGPPeers for ToR switches in all k8s clusters [14:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:17] FIRING: [14x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:05:30] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2005.codfw.wmnet with OS trixie [14:05:37] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=0) Reimaging k8s control planes of cluster staging-codfw: trixie upgrade [14:06:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1097.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:07:49] I can deploy [14:08:17] FIRING: [8x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:11:38] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11640623 (10lerickson) 05Open→03Resolved Oh, I am closing this, because I found the documentation for getting a Kerberos identity: https://wikitech.wikimedia.org/wiki... [14:11:58] (03CR) 10Elukey: [C:03+1] Remove obsolete config override for git protocol v2 [puppet] - 10https://gerrit.wikimedia.org/r/1242299 (owner: 10Muehlenhoff) [14:12:17] (03CR) 10Elukey: [C:03+1] Obsolete airflow-analytics-product-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1240336 (owner: 10Muehlenhoff) [14:13:17] FIRING: [16x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:10] !log cleaning useless rows of bot_passwords (T417977) [14:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:14] (03CR) 10A smart kitten: "scheduled for [puppet request deployment window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260224T1700) @ 17:00 UTC" [puppet] - 10https://gerrit.wikimedia.org/r/1238369 (https://phabricator.wikimedia.org/T417020) (owner: 10A smart kitten) [14:17:52] (03CR) 10Muehlenhoff: [C:03+2] Obsolete airflow-analytics-product-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1240336 (owner: 10Muehlenhoff) [14:17:53] 10ops-drmrs: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T418136 (10phaultfinder) 03NEW [14:18:17] FIRING: [18x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:21:34] (03PS1) 10Ssingh: sre.dns.admin: improve cookbook task ID and action_string [cookbooks] - 10https://gerrit.wikimedia.org/r/1242405 [14:21:44] (03CR) 10Bking: "Good question, and the answer is yes, the application owners are part of Data Engineering. We don't necessarily need their approval, but w" [dns] - 10https://gerrit.wikimedia.org/r/1238441 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking) [14:24:24] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11640665 (10Jclark-ctr) [14:26:54] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11640673 (10Volans) >>! In T330997#11640417, @Clement_Goubert wrote: > Not a fan of negative names, I think `ch... [14:27:01] (03CR) 10Muehlenhoff: [C:03+2] mariadb:packages: Remove spec test [puppet] - 10https://gerrit.wikimedia.org/r/1242398 (owner: 10Muehlenhoff) [14:27:14] (03CR) 10Ssingh: "dry-run output looks good:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1242405 (owner: 10Ssingh) [14:29:18] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T418062#11640676 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:29:30] (03CR) 10Fabfur: [C:03+1] sre.dns.admin: improve cookbook task ID and action_string [cookbooks] - 10https://gerrit.wikimedia.org/r/1242405 (owner: 10Ssingh) [14:29:37] (03CR) 10Slyngshede: [C:03+1] "Look good." [cookbooks] - 10https://gerrit.wikimedia.org/r/1242405 (owner: 10Ssingh) [14:30:19] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: improve cookbook task ID and action_string [cookbooks] - 10https://gerrit.wikimedia.org/r/1242405 (owner: 10Ssingh) [14:32:18] (03PS1) 10Muehlenhoff: Obsolete airflow-search-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1242407 [14:33:17] FIRING: [20x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:24] (03Merged) 10jenkins-bot: sre.dns.admin: improve cookbook task ID and action_string [cookbooks] - 10https://gerrit.wikimedia.org/r/1242405 (owner: 10Ssingh) [14:38:17] (03CR) 10Arnaudb: gerrit: add gerrit-replica service to LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240294 (https://phabricator.wikimedia.org/T417897) (owner: 10Arnaudb) [14:38:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:29] !log dummy sre.dns.admin run [14:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:34] !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool esams [reason: no reason specified, no task ID specified] [14:38:39] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: pool esams [reason: no reason specified, no task ID specified] [14:40:15] !log UTC afternoon deploy window over (skipped) [14:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:30] (03PS1) 10Elukey: sre.hosts.provision: add a better message when no NICS are found in BIOS [cookbooks] - 10https://gerrit.wikimedia.org/r/1242408 [14:40:31] (03CR) 10Vgutierrez: [C:03+1] "please unify hiera values across DCs in a following commit" [puppet] - 10https://gerrit.wikimedia.org/r/1237247 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:41:19] (03CR) 10Arnaudb: "- for the DNS record: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/templates/wmnet#1045" [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417897) (owner: 10Arnaudb) [14:42:07] (03CR) 10Fabfur: [C:03+2] cache::upload: enable global ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1237247 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:42:19] (03PS1) 10Ayounsi: Nokia: add local-as to k8s BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1242410 (https://phabricator.wikimedia.org/T417817) [14:43:55] (03CR) 10CI reject: [V:04-1] Nokia: add local-as to k8s BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1242410 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [14:45:28] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: add a better message when no NICS are found in BIOS [cookbooks] - 10https://gerrit.wikimedia.org/r/1242408 (owner: 10Elukey) [14:45:31] !log installing busybox updates from Bookworm point release [14:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:34] (03PS2) 10Ayounsi: Nokia: add local-as to k8s BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1242410 (https://phabricator.wikimedia.org/T417817) [14:47:32] (03PS2) 10Elukey: sre.hosts.provision: add a better message when no NICS are found in BIOS [cookbooks] - 10https://gerrit.wikimedia.org/r/1242408 [14:47:41] (03PS1) 10Cwhite: admin: add key for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1242411 [14:47:50] (03CR) 10Elukey: [C:03+2] docker_registry: route /v2/test prefix to s3/apus [puppet] - 10https://gerrit.wikimedia.org/r/1239164 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [14:52:21] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: add a better message when no NICS are found in BIOS [cookbooks] - 10https://gerrit.wikimedia.org/r/1242408 (owner: 10Elukey) [14:54:32] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/ml-serve-codfw: maintenance [14:55:06] (03PS1) 10Fabfur: cache::upload: cleanup hiera [puppet] - 10https://gerrit.wikimedia.org/r/1242414 (https://phabricator.wikimedia.org/T406545) [14:55:15] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/ml-serve-codfw: maintenance [14:55:43] (03PS4) 10Arnaudb: gerrit: add gerrit-replica service to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240294 (https://phabricator.wikimedia.org/T417998) [14:55:53] (03PS4) 10Arnaudb: gerrit: add gerrit-replica backend to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417897) [14:56:05] (03PS5) 10Arnaudb: gerrit: add gerrit-replica backend to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) [14:58:05] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242414 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:59:11] (03PS5) 10Arnaudb: gerrit: add gerrit-replica service to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240294 (https://phabricator.wikimedia.org/T417998) [14:59:21] (03PS6) 10Arnaudb: gerrit: add gerrit-replica backend to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) [14:59:28] (03PS7) 10Arnaudb: gerrit: add gerrit-replica backend to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) [14:59:49] !log dpogorzelski@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=recommendation-api,name=codfw [15:04:16] (03PS1) 10Arnaudb: gerrit: prevent NodeTextfileStale alert on nft throttling [alerts] - 10https://gerrit.wikimedia.org/r/1242413 (https://phabricator.wikimedia.org/T418139) [15:04:58] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:05:02] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:05:02] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:05:07] (03CR) 10Vgutierrez: [C:03+1] "thx" [puppet] - 10https://gerrit.wikimedia.org/r/1242414 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [15:05:58] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:06:02] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:06:02] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:06:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:06:22] (03PS1) 10Dpogorzelski: ml-serve-codfw: k8s deps update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242415 (https://phabricator.wikimedia.org/T414485) [15:07:24] !log jayme@deploy1003 conftool action : gfet; selector: dc=eqiad,cluster=kubesvc [15:07:42] (03CR) 10Ayounsi: [C:03+1] Switch the netinsights role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1242372 (owner: 10Muehlenhoff) [15:08:10] (03PS1) 10Itamar Givon: Add configurations for graphql usage survey and its pipleine tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) [15:08:17] FIRING: [20x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:01] (03CR) 10CI reject: [V:04-1] Add configurations for graphql usage survey and its pipleine tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [15:09:12] (03PS2) 10Itamar Givon: Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) [15:09:20] (03PS1) 10Dpogorzelski: ml-serve-codfw: update k8s [puppet] - 10https://gerrit.wikimedia.org/r/1242417 (https://phabricator.wikimedia.org/T414485) [15:09:45] !log jayme@deploy1003 conftool action : set/weight=10; selector: dc=eqiad,cluster=kubernetes-staging,service=kubesvc [15:09:53] !log jayme@deploy1003 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes-staging,service=kubesvc [15:10:04] (03CR) 10CI reject: [V:04-1] Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [15:11:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:12:16] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster ml-serve-codfw: Kubernetes upgrade [15:12:54] (03PS3) 10Itamar Givon: Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) [15:12:58] (03CR) 10JMeybohm: [C:03+2] k8s-staging: Switch scheduler from wrr to mh [puppet] - 10https://gerrit.wikimedia.org/r/1242288 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [15:13:17] FIRING: [20x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:14:06] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Figure out plan for mailman IP situation - https://phabricator.wikimedia.org/T278495#11640839 (10ABran-WMF) 05Open→03Stalled marking stalled because blocked by {T286066} [15:14:31] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: wikikube-staging-worker-eqiad@eqiad [15:14:58] (03CR) 10CI reject: [V:04-1] ml-serve-codfw: k8s deps update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242415 (https://phabricator.wikimedia.org/T414485) (owner: 10Dpogorzelski) [15:15:22] (03CR) 10Fabfur: [C:03+2] cache::upload: cleanup hiera [puppet] - 10https://gerrit.wikimedia.org/r/1242414 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [15:15:56] dpogorzelski@cumin1003 wipe-cluster (PID 373786) is awaiting input [15:18:17] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:24] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [15:20:10] (03PS1) 10JMeybohm: loadbalaner.migrate-service-ipip: Allow to skip puppet on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1242419 (https://phabricator.wikimedia.org/T352956) [15:20:16] (03CR) 10Dpogorzelski: [C:03+2] ml-serve-codfw: update k8s [puppet] - 10https://gerrit.wikimedia.org/r/1242417 (https://phabricator.wikimedia.org/T414485) (owner: 10Dpogorzelski) [15:20:22] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [15:20:22] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: wikikube-staging-worker-eqiad@eqiad [15:20:27] (03PS2) 10JMeybohm: loadbalancer.migrate-service-ipip: Allow to skip puppet on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1242419 (https://phabricator.wikimedia.org/T352956) [15:20:40] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-ctrl_6443: Servers ml-serve-ctrl2002.codfw.wmnet are marked down but pooled: k8s-ingress-ml-serve_31443: Servers ml-serve2008.codfw.wmnet, ml-serve2010.codfw.wmnet, ml-serve2005.codfw.wmnet, ml-serve2006.codfw.wmnet, ml-serve2009.codfw.wmnet, ml-serve2011.codfw.wmnet are marked down but pooled: inference_30443: Servers ml-serve2002.codfw.wmnet, ml [15:20:40] 10.codfw.wmnet, ml-serve2003.codfw.wmnet, ml-serve2006.codfw.wmnet, ml-serve2009.codfw.wmnet, ml-serve2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:20:44] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve2002.codfw.wmnet, ml-serve2003.codfw.wmnet, ml-serve2005.codfw.wmnet, ml-serve2004.codfw.wmnet, ml-serve2001.codfw.wmnet, ml-serve2011.codfw.wmnet are marked down but pooled: k8s-ingress-ml-serve_31443: Servers ml-serve2008.codfw.wmnet, ml-serve2003.codfw.wmnet, ml-serve2005.codfw.wmnet, ml-serve2006.codfw.wmnet, ml-s [15:20:44] .codfw.wmnet, ml-serve2011.codfw.wmnet are marked down but pooled: ml-ctrl_6443: Servers ml-serve-ctrl2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:21:37] (03PS4) 10CDobbins: codfw: add the following cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1240784 [15:22:40] dpogorzelski@cumin1003 wipe-cluster (PID 373786) is awaiting input [15:23:11] (03CR) 10Ssingh: "Looks good to me! Deferring to @bcornwall@wikimedia.org and @slyngshede@wikimedia.org for their review." [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (owner: 10CDobbins) [15:23:17] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:23:19] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: wikikube-staging-worker-codfw@codfw [15:23:30] !log jayme@cumin1003 END (ERROR) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=97) for alias: wikikube-staging-worker-codfw@codfw [15:23:32] (03PS5) 10CDobbins: codfw: add the following cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1240784 [15:23:52] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: wikikube-staging-worker-codfw@codfw [15:24:09] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8123/console" [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (owner: 10CDobbins) [15:24:45] (03CR) 10CDobbins: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (owner: 10CDobbins) [15:25:00] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [15:25:56] (03CR) 10CI reject: [V:04-1] loadbalancer.migrate-service-ipip: Allow to skip puppet on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1242419 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [15:28:31] (03PS2) 10Dpogorzelski: ml-serve-codfw: k8s deps update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242415 (https://phabricator.wikimedia.org/T414485) [15:28:52] (03PS3) 10Dpogorzelski: ml-serve-codfw: k8s deps update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242415 (https://phabricator.wikimedia.org/T414485) [15:29:09] 06SRE, 06Infrastructure-Foundations, 10Mail: Remove mail alias/fork from dmarc-rua@wikimedia.org to dmarc@donate.wikimedia.org - https://phabricator.wikimedia.org/T417941#11640965 (10jhathaway) p:05Triage→03Medium [15:29:14] (03PS3) 10JMeybohm: loadbalancer.migrate-service-ipip: Allow to skip puppet on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1242419 (https://phabricator.wikimedia.org/T352956) [15:29:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad A/B switch cabling documentation - https://phabricator.wikimedia.org/T418018#11640966 (10ayounsi) p:05Triage→03Medium [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T1530) [15:33:50] jayme@cumin1003 migrate-service-ipip (PID 376592) is awaiting input [15:34:15] (03CR) 10CI reject: [V:04-1] loadbalancer.migrate-service-ipip: Allow to skip puppet on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1242419 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [15:37:16] (03CR) 10Dpogorzelski: [C:03+2] ml-serve-codfw: k8s deps update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242415 (https://phabricator.wikimedia.org/T414485) (owner: 10Dpogorzelski) [15:37:25] 06SRE, 06Infrastructure-Foundations: investigate making 'notrack' the default on our ferm rules - https://phabricator.wikimedia.org/T240495#11640991 (10CDanis) p:05Medium→03Low [15:37:34] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:37:49] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [15:37:49] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: wikikube-staging-worker-codfw@codfw [15:38:53] (03PS4) 10JMeybohm: loadbalancer.migrate-service-ipip: Allow to skip puppet on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1242419 (https://phabricator.wikimedia.org/T352956) [15:39:11] 06SRE, 06Infrastructure-Foundations, 07Security: Access requests process: Consideration of 'indirect' sudo rules via e.g. keyholder - https://phabricator.wikimedia.org/T207739#11640993 (10LSobanski) 05Open→03Declined [15:40:26] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-ctrl_6443: Servers ml-serve-ctrl2001.codfw.wmnet are marked down but pooled: k8s-ingress-ml-serve_31443: Servers ml-serve2005.codfw.wmnet, ml-serve2006.codfw.wmnet, ml-serve2009.codfw.wmnet, ml-serve2004.codfw.wmnet, ml-serve2001.codfw.wmnet, ml-serve2002.codfw.wmnet are marked down but pooled: inference_30443: Servers ml-serve2003.codfw.wmnet, ml [15:40:26] 06.codfw.wmnet, ml-serve2009.codfw.wmnet, ml-serve2004.codfw.wmnet, ml-serve2001.codfw.wmnet, ml-serve2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:43:30] (03CR) 10Elukey: locking: Add a mechanism for a global Spicerack lock. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [15:44:25] !log jayme@cumin1003 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster staging-eqiad: trixie upgrade [15:45:36] dpogorzelski: ^^ looks like there is an issue with the ml cluster [15:46:49] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242421 (https://phabricator.wikimedia.org/T128546) [15:47:49] (03CR) 10Tiziano Fogli: [C:03+2] slothslos: add module to build and deploy sloth manifests [puppet] - 10https://gerrit.wikimedia.org/r/1239166 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [15:48:09] (03CR) 10Tiziano Fogli: [C:03+2] thanos::rule: add ExecReload to the service unit [puppet] - 10https://gerrit.wikimedia.org/r/1239906 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [15:48:11] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kubestagemaster1003.eqiad.wmnet with OS trixie [15:50:23] dpogorzelski@cumin1003 wipe-cluster (PID 373786) is awaiting input [15:50:44] 06SRE, 06Infrastructure-Foundations, 10Keyholder: Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003#11641074 (10elukey) @thcipriani Hi! We are reviewing the backlog and we are wondering if you are still interested in pursuing this. [15:51:10] FIRING: BFDdown: BFD session down between cr2-eqiad and fe80::b6f9:5dff:fe30:cd38 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:53:45] (03CR) 10Joal: [C:03+1] Fix for new banner activity dimension in Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/1240762 (https://phabricator.wikimedia.org/T414478) (owner: 10Ejegg) [15:56:05] 06SRE, 06Infrastructure-Foundations, 10Mail: Exim panics when spamd reaches maxchildren - https://phabricator.wikimedia.org/T166291#11641137 (10jhathaway) 05Open→03Declined We only run rspamd in combination with postfix now [15:56:10] RESOLVED: BFDdown: BFD session down between cr2-eqiad and fe80::b6f9:5dff:fe30:cd38 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:56:21] (03PS2) 10Jdrewniak: Updating portals submodule for Wikipedia 25 birthday [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242421 (https://phabricator.wikimedia.org/T128546) [15:56:52] (03Abandoned) 10Jdrewniak: Wikipedia portal 25th birthday update. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238699 (https://phabricator.wikimedia.org/T416015) (owner: 10Jdrewniak) [15:57:07] (03PS2) 10Ejegg: Fix for new banner activity dimension in Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/1240762 (https://phabricator.wikimedia.org/T414478) [15:57:35] (03CR) 10Gehel: [C:03+2] Fix for new banner activity dimension in Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/1240762 (https://phabricator.wikimedia.org/T414478) (owner: 10Ejegg) [15:58:17] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T418027#11641150 (10Jhancock.wm) →14Duplicate dup:03T416726 [15:58:19] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T416726#11641152 (10Jhancock.wm) [15:58:42] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#11641155 (10JMeybohm) p:05Medium→03High Rising priority for parity with {T390861} [16:00:31] (03PS1) 10Ssingh: wikimedia.org: add IPv6 glue records for ns0 and ns2 [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) [16:00:36] (03PS1) 10Kosta Harlan: IPInfo: Grant ipinfo-view-arbitrary-ip to checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242424 (https://phabricator.wikimedia.org/T374718) [16:01:25] (03CR) 10CI reject: [V:04-1] wikimedia.org: add IPv6 glue records for ns0 and ns2 [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:02:48] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1003.eqiad.wmnet with reason: host reimage [16:04:19] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device asw1-22-ulsfo [16:04:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-22-ulsfo [16:04:58] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device asw1-23-ulsfo [16:05:11] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device asw1-23-ulsfo [16:05:17] (03CR) 10Ssingh: "Will add PTRs and open for review" [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:07:06] 06SRE, 06Infrastructure-Foundations: Improve management of users/groups on servers in production - https://phabricator.wikimedia.org/T235161#11641201 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is done [16:08:14] 06SRE, 06Infrastructure-Foundations: slapd fails to restart sometimes - https://phabricator.wikimedia.org/T269394#11641211 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This happened with a very old slapd version and we haven't seen it since. [16:08:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:08:34] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1003.eqiad.wmnet with reason: host reimage [16:09:26] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:09:35] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:10:21] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:10:35] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:10:45] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:10:57] 06SRE, 06Infrastructure-Foundations, 10observability: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147#11641228 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This was mostly an issue with the automated restart (profile::auto_restarts::service) on older Debia... [16:11:10] FIRING: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:12:56] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:13:20] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: puppetmasters: update the puppet masters so they use them self for the puppet run - https://phabricator.wikimedia.org/T238093#11641262 (10MoritzMuehlenhoff) 05Open→03Declined No longer relevant with Puppet 7 [16:14:03] (03PS1) 10Fabfur: hiera: test for haproxy30 [puppet] - 10https://gerrit.wikimedia.org/r/1242427 [16:14:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242427 (owner: 10Fabfur) [16:14:44] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:15:29] (03PS1) 10Daniel Kinzler: rest-gateway: disable external_services for monikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242428 (https://phabricator.wikimedia.org/T414333) [16:16:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:16:45] (03PS1) 10Papaul: Add new Nokia switches for initial homer run [homer/public] - 10https://gerrit.wikimedia.org/r/1242429 (https://phabricator.wikimedia.org/T408511) [16:16:45] (03CR) 10Dzahn: [C:03+1] gerrit: prevent NodeTextfileStale alert on nft throttling [alerts] - 10https://gerrit.wikimedia.org/r/1242413 (https://phabricator.wikimedia.org/T418139) (owner: 10Arnaudb) [16:17:54] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device asw1-23-ulsfo [16:18:08] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device asw1-23-ulsfo [16:18:47] (03PS2) 10Daniel Kinzler: rest-gateway: disable external_services for minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242428 (https://phabricator.wikimedia.org/T414333) [16:19:09] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:19:15] (03PS4) 10Eevans: cassandra: enable use of Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1241042 (https://phabricator.wikimedia.org/T418010) [16:19:23] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: disable external_services for minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242428 (https://phabricator.wikimedia.org/T414333) (owner: 10Daniel Kinzler) [16:19:43] (03PS6) 10Daniel Kinzler: rest-gateway: remove support for insecure user ID cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237295 (https://phabricator.wikimedia.org/T405578) [16:19:55] (03CR) 10CI reject: [V:04-1] cassandra: enable use of Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1241042 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [16:22:11] (03PS5) 10Eevans: cassandra: enable use of Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1241042 (https://phabricator.wikimedia.org/T418010) [16:22:15] (03CR) 10Papaul: [C:03+2] Add new Nokia switches for initial homer run [homer/public] - 10https://gerrit.wikimedia.org/r/1242429 (https://phabricator.wikimedia.org/T408511) (owner: 10Papaul) [16:25:41] (03PS6) 10Eevans: cassandra: enable use of Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1241042 (https://phabricator.wikimedia.org/T418010) [16:26:06] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:26:14] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11641309 (10MoritzMuehlenhoff) [16:27:42] (03PS2) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240690 (https://phabricator.wikimedia.org/T417717) [16:27:42] (03PS2) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240692 (https://phabricator.wikimedia.org/T417717) [16:28:15] (03PS1) 10Muehlenhoff: Migrate swift-rsync to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1242430 [16:28:54] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:28:59] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1241042 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [16:29:09] (03PS3) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240692 (https://phabricator.wikimedia.org/T417717) [16:29:17] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:29:20] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:29:25] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1003.eqiad.wmnet with OS trixie [16:29:29] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:29:31] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:29:47] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:29:51] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:29:56] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11641319 (10Jhancock.wm) whoops [16:30:01] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:30:05] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T1630). [16:30:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11641320 (10Jhancock.wm) [16:30:19] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:30:25] (03PS3) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240690 (https://phabricator.wikimedia.org/T417717) [16:30:27] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:30:30] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:31:31] (03PS1) 10Muehlenhoff: Enable Bird 2.18 for cloudservices/eqiad1 and cloudlb/eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1242431 (https://phabricator.wikimedia.org/T413740) [16:31:40] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:32:13] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:32:58] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kubestagemaster1004.eqiad.wmnet with OS trixie [16:33:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:57] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install frqueue2004 - https://phabricator.wikimedia.org/T416251#11641345 (10Jhancock.wm) [16:35:52] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2021 to codfw - jhancock@cumin2002" [16:36:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2021 to codfw - jhancock@cumin2002" [16:36:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:36:11] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2021 [16:36:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2021 [16:36:36] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:36:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242421 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:36:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:37:21] (03PS7) 10Eevans: cassandra: enable use of Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1241042 (https://phabricator.wikimedia.org/T418010) [16:37:44] (03Merged) 10jenkins-bot: Updating portals submodule for Wikipedia 25 birthday [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242421 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:38:03] !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1242421|Updating portals submodule for Wikipedia 25 birthday (T128546)]] [16:38:08] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:38:56] (03PS1) 10Muehlenhoff: PHP: Run spec tests on Bullseye and Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1242433 [16:39:08] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1241042 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [16:40:11] !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:1242421|Updating portals submodule for Wikipedia 25 birthday (T128546)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:40:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242430 (owner: 10Muehlenhoff) [16:40:43] jhancock@cumin2002 provision (PID 2025714) is awaiting input [16:41:04] !log jdrewniak@deploy2002 jdrewniak: Continuing with sync [16:42:36] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11641395 (10Papaul) User homer password set on both switches and sre.network.tls.cookbook failed on asw1-23-ulsfo. first homer run on asw1-22-ulsfo is giving the... [16:43:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241712 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza) [16:43:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241713 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza) [16:44:08] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.2.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240690 (https://phabricator.wikimedia.org/T417717) (owner: 10Santiago Faci) [16:44:38] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.2.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240692 (https://phabricator.wikimedia.org/T417717) (owner: 10Santiago Faci) [16:45:03] !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1242421|Updating portals submodule for Wikipedia 25 birthday (T128546)]] (duration: 06m 59s) [16:45:07] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:45:12] jhancock@cumin2002 provision (PID 2025714) is awaiting input [16:46:25] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240690 (https://phabricator.wikimedia.org/T417717) (owner: 10Santiago Faci) [16:46:38] (03PS1) 10Eevans: cassandra-dev2001: enable Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1242435 (https://phabricator.wikimedia.org/T418010) [16:47:15] (03PS3) 10Daniel Kinzler: rest-gateway: use MINUTE limits in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239669 [16:47:23] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242435 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [16:47:37] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1004.eqiad.wmnet with reason: host reimage [16:48:17] RESOLVED: [2x] ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:47] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:50:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:50:33] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:54:03] (03CR) 10Eevans: [C:03+2] cassandra: enable use of Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1241042 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [16:54:34] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1004.eqiad.wmnet with reason: host reimage [16:55:34] (03PS2) 10Ssingh: wikimedia.org: add IPv6 glue records for ns0 and ns2 [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) [16:56:09] jhancock@cumin2002 netbox (PID 2031412) is awaiting input [16:56:23] (03CR) 10CI reject: [V:04-1] wikimedia.org: add IPv6 glue records for ns0 and ns2 [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:57:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2021 to codfw - jhancock@cumin2002" [16:57:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2021 to codfw - jhancock@cumin2002" [16:57:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:57:56] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [16:58:03] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [16:59:12] (03CR) 10Eevans: [C:03+2] cassandra-dev2001: enable Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1242435 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [16:59:26] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:59:44] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [17:01:07] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [17:01:26] (03PS3) 10Daniel Kinzler: rest-gateway: improve readability of tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239972 [17:01:36] (03CR) 10Daniel Kinzler: rest-gateway: improve readability of tests (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239972 (owner: 10Daniel Kinzler) [17:02:29] (03CR) 10Daniel Kinzler: [C:04-1] python tests: use type hints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239529 (owner: 10Daniel Kinzler) [17:02:51] (03PS1) 10Santiago Faci: test-kitchen kubernetes chart: New config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242438 (https://phabricator.wikimedia.org/T418088) [17:03:23] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2021 to codfw - jhancock@cumin2002" [17:03:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2021 to codfw - jhancock@cumin2002" [17:03:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:04:38] (03PS1) 10Dpogorzelski: kserve: fix dependency on cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242439 [17:06:33] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [17:06:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:07:11] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [17:08:29] dpogorzelski@cumin1003 wipe-cluster (PID 373786) is awaiting input [17:08:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:09:33] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:09:35] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:10:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:12:00] 06SRE, 06Security-Team, 06Traffic, 05FY2025-26 WE 4.6 - Account Security, and 2 others: High volume of suspicious CSP reports for itwiki - https://phabricator.wikimedia.org/T414014#11641554 (10sbassett) [17:14:37] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1004.eqiad.wmnet with OS trixie [17:16:22] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:18:11] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kubestagemaster1005.eqiad.wmnet with OS trixie [17:19:58] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding frqueue2004 to codfw - jhancock@cumin2002" [17:20:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding frqueue2004 to codfw - jhancock@cumin2002" [17:20:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:20:26] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [17:21:09] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [17:21:53] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [17:22:29] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [17:22:46] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [17:22:58] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [17:23:16] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [17:23:28] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [17:23:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:23:42] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [17:23:49] (03CR) 10Elukey: [C:03+1] kserve: fix dependency on cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242439 (owner: 10Dpogorzelski) [17:23:53] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [17:24:06] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [17:24:16] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [17:24:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T415786)', diff saved to https://phabricator.wikimedia.org/P88983 and previous config saved to /var/cache/conftool/dbconfig/20260223-172421-marostegui.json [17:24:26] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [17:24:28] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [17:24:44] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install frqueue2004 - https://phabricator.wikimedia.org/T416251#11641604 (10Jhancock.wm) [17:24:47] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [17:24:59] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host frqueue2004 [17:25:02] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [17:25:11] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [17:25:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host frqueue2004 [17:25:21] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [17:25:33] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [17:27:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2021.codfw.wmnet with OS bullseye [17:27:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11641608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-fe2021.codfw.wmnet with OS bullseye [17:28:12] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.wipe-cluster (exit_code=0) Wipe the K8s cluster ml-serve-codfw: Kubernetes upgrade [17:29:54] 10SRE-SLO: Sloth: productionize and onboard all SLOs - https://phabricator.wikimedia.org/T416262#11641611 (10herron) [17:30:05] 10SRE-SLO: Sloth: productionize and onboard all SLOs - https://phabricator.wikimedia.org/T416262#11641612 (10herron) [17:30:48] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in codfw/ml-serve-codfw: maintenance [17:31:55] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in codfw/ml-serve-codfw: maintenance [17:32:39] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1005.eqiad.wmnet with reason: host reimage [17:35:03] !log dpogorzelski@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=recommendation-api,name=codfw [17:35:33] 10SRE-SLO, 10Observability-Alerting, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q3): sloth deployment - https://phabricator.wikimedia.org/T414579#11641641 (10herron) 05Open→03Resolved a:03herron [17:35:40] (03PS2) 10Santiago Faci: test-kitchen kubernetes chart: New config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242438 (https://phabricator.wikimedia.org/T418088) [17:35:54] (03PS3) 10Santiago Faci: test-kitchen kubernetes chart: New config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242438 (https://phabricator.wikimedia.org/T418088) [17:36:24] (03PS1) 10Eevans: cassandra.in.sh: port Java 17'isms from 5.0 branch [puppet] - 10https://gerrit.wikimedia.org/r/1242447 (https://phabricator.wikimedia.org/T418010) [17:36:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:18] (03CR) 10Eevans: [C:03+2] cassandra.in.sh: port Java 17'isms from 5.0 branch [puppet] - 10https://gerrit.wikimedia.org/r/1242447 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [17:38:27] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1005.eqiad.wmnet with reason: host reimage [17:39:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P88985 and previous config saved to /var/cache/conftool/dbconfig/20260223-173930-marostegui.json [17:42:26] (03PS1) 10Elukey: profile::httpbb::docker-registry: improve tests [puppet] - 10https://gerrit.wikimedia.org/r/1242452 (https://phabricator.wikimedia.org/T414576) [17:44:29] (03CR) 10Elukey: profile::httpbb::docker-registry: improve tests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1242452 (https://phabricator.wikimedia.org/T414576) (owner: 10Elukey) [17:47:02] 10SRE-SLO, 06Abstract Wikipedia team, 06serviceops, 06ServiceOps new: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160 (10Jdforrester-WMF) 03NEW [17:49:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:54:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P88986 and previous config saved to /var/cache/conftool/dbconfig/20260223-175438-marostegui.json [17:57:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2021.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:58:04] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device asw1-23-ulsfo [17:58:52] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device asw1-23-ulsfo [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T1800) [18:00:05] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T1800). [18:00:19] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1005.eqiad.wmnet with OS trixie [18:00:25] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=0) Reimaging k8s control planes of cluster staging-eqiad: trixie upgrade [18:01:47] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device asw1-23-ulsfo [18:02:19] !log ayounsi@cumin1003 END (ERROR) - Cookbook sre.network.tls (exit_code=97) for network device asw1-23-ulsfo [18:03:12] (03PS6) 10CDobbins: codfw: add the following cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (https://phabricator.wikimedia.org/T418161) [18:05:50] (03Abandoned) 10Ssingh: DNS IPv6 anycast: change router config to support new ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1238015 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [18:05:57] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install frqueue2004 - https://phabricator.wikimedia.org/T416251#11641801 (10Jhancock.wm) @Dwisehaupt @Jgreen this is ready for y'all. [18:07:44] 10SRE-SLO: Sloth: productionize and onboard all SLOs - https://phabricator.wikimedia.org/T416262#11641807 (10herron) [18:08:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2021.codfw.wmnet with reason: host reimage [18:09:15] (03CR) 10Jforrester: [C:03+1] Update documenation to reference config-schema.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241167 (owner: 10Zabe) [18:09:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T415786)', diff saved to https://phabricator.wikimedia.org/P88987 and previous config saved to /var/cache/conftool/dbconfig/20260223-180947-marostegui.json [18:09:52] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [18:10:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2247.codfw.wmnet with reason: Maintenance [18:10:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2247 (T415786)', diff saved to https://phabricator.wikimedia.org/P88988 and previous config saved to /var/cache/conftool/dbconfig/20260223-181011-marostegui.json [18:10:39] FIRING: CoreBGPDown: Core BGP session down between cr2-esams and cr2-eqdfw (208.80.153.217) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr2-esams:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:12:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2021.codfw.wmnet with reason: host reimage [18:15:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-esams and cr2-eqdfw (208.80.153.217) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr2-esams:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:16:40] 10SRE-SLO: Sloth: migrate existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163 (10herron) 03NEW [18:19:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:24:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad A/B switch cabling documentation - https://phabricator.wikimedia.org/T418018#11641861 (10RobH) a:05Papaul→03RobH [18:31:08] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:32:11] (03PS3) 10Ssingh: wikimedia.org: add IPv6 glue records for ns0 and ns2 [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) [18:33:03] (03CR) 10CI reject: [V:04-1] wikimedia.org: add IPv6 glue records for ns0 and ns2 [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [18:33:08] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [18:34:14] jhancock@cumin2002 reimage (PID 2051568) is awaiting input [18:35:01] (03PS4) 10Ssingh: wikimedia.org: add IPv6 glue records for ns0 and ns2 [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) [18:35:35] (03PS7) 10CDobbins: codfw: add the following cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (https://phabricator.wikimedia.org/T418161) [18:37:07] (03CR) 10Ssingh: "@ayounsi@wikimedia.org: can you please double-check the ns02 v6 PTRs here? Thank you." [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [18:38:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:39:22] (03PS5) 10Ssingh: wikimedia.org: add IPv6 glue records for ns0 and ns2 [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) [18:42:17] 06SRE, 06Infrastructure-Foundations, 10Keyholder: Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003#11641936 (10thcipriani) >>! In T203003#11641074, @elukey wrote: > @thcipriani Hi! We are reviewing the backlog and we are wondering if you are still interested in pursuing this. T... [18:53:11] (03CR) 10Ssingh: "OK. I think we can start with that and then review from Traffic's end. I am not trying to save work here to be clear; the idea is that you" [dns] - 10https://gerrit.wikimedia.org/r/1238441 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking) [18:54:58] (03CR) 10Ssingh: "Thanks for cleaning it up! There may be some more work on this as part of the CSP rollout so I will review when that is finalized and in c" [puppet] - 10https://gerrit.wikimedia.org/r/1240799 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [19:01:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:01:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2021.codfw.wmnet with OS bullseye [19:01:38] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11641984 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-fe2021.codfw.wmnet with OS bullseye completed: - ms-fe2021 (**PAS... [19:04:02] (03CR) 10BCornwall: [C:03+1] codfw: add the following cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (https://phabricator.wikimedia.org/T418161) (owner: 10CDobbins) [19:06:02] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:06:49] (03CR) 10BBlack: [C:03+1] "SGTM!" [dns] - 10https://gerrit.wikimedia.org/r/1242423 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [19:07:20] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, 06Security-Team: Request membership in wmf-deployment group for Rsilvola - https://phabricator.wikimedia.org/T418004#11642002 (10Rsilvola) 05Declined→03Open (reopening) [19:07:49] (03Abandoned) 10Jgiannelos: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196686 (owner: 10PipelineBot) [19:07:54] (03PS1) 10Eevans: cassandra: argument typo [puppet] - 10https://gerrit.wikimedia.org/r/1242461 (https://phabricator.wikimedia.org/T418010) [19:08:36] (03CR) 10Jgiannelos: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240783 (owner: 10PipelineBot) [19:09:49] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2022 to codfw - jhancock@cumin2002" [19:09:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2022 to codfw - jhancock@cumin2002" [19:09:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:10:17] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:11:10] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240783 (owner: 10PipelineBot) [19:13:40] (03PS1) 10Scott French: envoy: Allow inboundonly drain and support min wait time [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) [19:13:40] (03CR) 10Scott French: [V:03+2] "Built and verified against local envoy test setup." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [19:13:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2022 to codfw - jhancock@cumin2002" [19:13:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2022 to codfw - jhancock@cumin2002" [19:13:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:14:28] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2022 [19:14:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2022 [19:15:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2022.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:21:04] jhancock@cumin2002 provision (PID 2105873) is awaiting input [19:22:30] (03CR) 10Eevans: [C:03+2] cassandra: argument typo [puppet] - 10https://gerrit.wikimedia.org/r/1242461 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [19:25:42] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [19:26:11] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [19:32:04] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [19:32:14] (03CR) 10BCornwall: [C:03+1] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1242169 (https://phabricator.wikimedia.org/T418080) (owner: 10Gerrit maintenance bot) [19:33:51] (03PS1) 10Urbanecm: [Growth] Enable wmgGEMentorListJsonSchemaEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242466 (https://phabricator.wikimedia.org/T417422) [19:34:11] (03CR) 10BCornwall: "Yep!" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [19:34:48] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [19:34:54] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [19:35:35] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [19:36:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2022.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:38:03] (03PS1) 10Dzahn: gerrit: remove code for having multiple daemon users [puppet] - 10https://gerrit.wikimedia.org/r/1242467 (https://phabricator.wikimedia.org/T338470) [19:38:35] (03CR) 10CI reject: [V:04-1] gerrit: remove code for having multiple daemon users [puppet] - 10https://gerrit.wikimedia.org/r/1242467 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [19:39:09] (03CR) 10BCornwall: [V:03+1 C:03+1] "PCC SUCCESS (CORE_DIFF 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8125/c" [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (https://phabricator.wikimedia.org/T418161) (owner: 10CDobbins) [19:42:10] (03PS2) 10Dzahn: gerrit: remove code for having multiple daemon users [puppet] - 10https://gerrit.wikimedia.org/r/1242467 (https://phabricator.wikimedia.org/T338470) [19:42:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2022.codfw.wmnet with OS bullseye [19:42:24] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11642175 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-fe2022.codfw.wmnet with OS bullseye [19:48:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2022.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:50:57] (03PS1) 10Bartosz Dziewoński: UpdateAutomaticGlobalGroupMembership: Read user data from primary [extensions/CentralAuth] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242469 (https://phabricator.wikimedia.org/T416541) [19:55:56] (03CR) 10Dzahn: [C:03+2] "The class seems to also still be used by deployment_server and parsoid-test. lgtm. noop." [puppet] - 10https://gerrit.wikimedia.org/r/1242433 (owner: 10Muehlenhoff) [19:56:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2022.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:56:49] (03CR) 10Dzahn: ProdPasteBot: Call paste.edit instead of deprecated paste.create (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper) [20:00:06] (03CR) 10BCornwall: [V:03+1 C:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8127/co" [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (https://phabricator.wikimedia.org/T418161) (owner: 10CDobbins) [20:00:39] (03CR) 10Dzahn: [C:03+1] ProdPasteBot: Call paste.edit instead of deprecated paste.create [puppet] - 10https://gerrit.wikimedia.org/r/1241027 (https://phabricator.wikimedia.org/T410572) (owner: 10Aklapper) [20:01:02] (03CR) 10Dzahn: [V:03+1 C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1240778 (https://phabricator.wikimedia.org/T417824) (owner: 10Dzahn) [20:02:08] (03CR) 10Dzahn: [C:03+1] gerrit: disable service on gerrit2002 to reimage [puppet] - 10https://gerrit.wikimedia.org/r/1242272 (https://phabricator.wikimedia.org/T417247) (owner: 10Arnaudb) [20:03:14] (03CR) 10Dzahn: [C:03+1] gerrit: prepare replication resume for gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1242275 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [20:03:56] (03CR) 10Dzahn: [C:03+1] gerrit: alert for broken replication [alerts] - 10https://gerrit.wikimedia.org/r/1242399 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [20:05:04] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11642282 (10herron) [20:06:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2022.codfw.wmnet with reason: host reimage [20:07:24] (03CR) 10Dzahn: [C:03+1] gerrit: add gerrit-replica backend to LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [20:08:41] (03CR) 10Dzahn: gerrit: add gerrit-replica service to LVS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1240294 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [20:08:46] (03PS1) 10Jdrewniak: Change "Learn more" link underneath Baby Globe on Minerva [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242471 (https://phabricator.wikimedia.org/T417077) [20:10:11] (03CR) 10Dzahn: [C:03+1] "lgtm. and then hopefully the last time we have to flip like this and only the discovery name in the future" [dns] - 10https://gerrit.wikimedia.org/r/1242268 (https://phabricator.wikimedia.org/T417247) (owner: 10Arnaudb) [20:11:08] (03PS1) 10Eevans: cassandra: Java 8 no longer supported [puppet] - 10https://gerrit.wikimedia.org/r/1242473 (https://phabricator.wikimedia.org/T418010) [20:11:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242471 (https://phabricator.wikimedia.org/T417077) (owner: 10Jdrewniak) [20:11:54] (03CR) 10Dzahn: gerrit: swap gerrit-spare and gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1242269 (https://phabricator.wikimedia.org/T406334) (owner: 10Arnaudb) [20:12:36] (03PS1) 10Jdrewniak: i18n: Update community configuration copy [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242474 (https://phabricator.wikimedia.org/T415346) [20:12:42] (03PS2) 10Eevans: cassandra: Java 8 no longer supported [puppet] - 10https://gerrit.wikimedia.org/r/1242473 (https://phabricator.wikimedia.org/T418010) [20:12:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242474 (https://phabricator.wikimedia.org/T415346) (owner: 10Jdrewniak) [20:13:06] (03CR) 10Dzahn: [C:03+2] miscweb: add release for status.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240412 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [20:13:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2022.codfw.wmnet with reason: host reimage [20:14:09] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242473 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [20:14:58] (03Merged) 10jenkins-bot: miscweb: add release for status.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240412 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [20:16:12] Hello!! Not sure who to tag, but I've got some backports I have to run in 45 minutes and in the interest of time I was wondering if there were any objections to me getting started early (now)? It looks like there's nothing happening currently on the schedule [20:20:02] thcipriani: apologies if I'm bugging you - would you be the right person to ask the above? [20:20:09] jouncebot: nowandnext [20:20:09] No deployments scheduled for the next 0 hour(s) and 39 minute(s) [20:20:09] In 0 hour(s) and 39 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T2100) [20:20:55] toyofuku: calendar is empty, should be fine [20:20:58] toyofuku: it seems quiet and nothing else on the calendar. seems ok [20:21:15] Okay, thank you both!! [20:21:21] I might also deploy some stuff afterwards [20:22:21] (03CR) 10BCornwall: [V:03+1 C:03+2] codfw: add the following cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1240784 (https://phabricator.wikimedia.org/T418161) (owner: 10CDobbins) [20:22:21] toyofuku: ^ right channel to check it, looks like folks got you what you needed. Happy deploying! [20:22:43] Thank you and sorry to bother you all!! Deploying now ☺️ [20:24:02] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11642331 (10herron) [20:24:07] Doing a 3 patch backport, so double triple checking to make sure I'm deploying the right things [20:27:05] thank you to jan_drewniak for prepping all the cherry picks behind my back - let's begin! [20:27:17] Who knows how tagging in irc works lol [20:27:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1241026 (https://phabricator.wikimedia.org/T415355) (owner: 10Bernard Wang) [20:27:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242471 (https://phabricator.wikimedia.org/T417077) (owner: 10Jdrewniak) [20:27:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242474 (https://phabricator.wikimedia.org/T415346) (owner: 10Jdrewniak) [20:28:23] (03PS1) 10Eevans: Revert "cassandra-dev2001: enable Java 17" [puppet] - 10https://gerrit.wikimedia.org/r/1242480 [20:28:38] Been a while since I dropped music recs in this channel - I'm currently listening to THE GOAT by Polo G [20:30:25] (03CR) 10Eevans: [C:03+2] Revert "cassandra-dev2001: enable Java 17" [puppet] - 10https://gerrit.wikimedia.org/r/1242480 (owner: 10Eevans) [20:36:23] (03Merged) 10jenkins-bot: Migrate default user preference configuration to Community Configuration [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1241026 (https://phabricator.wikimedia.org/T415355) (owner: 10Bernard Wang) [20:36:25] (03Merged) 10jenkins-bot: Change "Learn more" link underneath Baby Globe on Minerva [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242471 (https://phabricator.wikimedia.org/T417077) (owner: 10Jdrewniak) [20:36:37] (03Merged) 10jenkins-bot: i18n: Update community configuration copy [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242474 (https://phabricator.wikimedia.org/T415346) (owner: 10Jdrewniak) [20:37:01] !log toyofuku@deploy2002 Started scap sync-world: Backport for [[gerrit:1241026|Migrate default user preference configuration to Community Configuration (T415355)]], [[gerrit:1242471|Change "Learn more" link underneath Baby Globe on Minerva (T417077)]], [[gerrit:1242474|i18n: Update community configuration copy (T415346)]] [20:37:08] T415355: Migrate default user preference configuration to Community Configuration - https://phabricator.wikimedia.org/T415355 [20:37:09] T417077: Add link to settings below Baby Globe on Minerva - https://phabricator.wikimedia.org/T417077 [20:37:09] T415346: Enhance CommunityConfiguration UI for Birthday Mode - https://phabricator.wikimedia.org/T415346 [20:37:30] (03PS1) 10Dzahn: releases: upgrade Java version from 17 to 21 [puppet] - 10https://gerrit.wikimedia.org/r/1242483 [20:38:14] We've got translation updates so this will take a sec [20:38:28] ooh, maybe not too long 🤞 [20:38:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:41:46] jhancock@cumin2002 reimage (PID 2119528) is awaiting input [20:46:22] jk it is indeed taking a long time [20:48:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:48:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2022.codfw.wmnet with OS bullseye [20:48:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11642412 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-fe2022.codfw.wmnet with OS bullseye completed: - ms-fe2022 (**WAR... [20:52:25] (03PS1) 10DLynch: Edit check: catch various places where an error could derail things [extensions/VisualEditor] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242493 (https://phabricator.wikimedia.org/T406836) [20:52:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/VisualEditor] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242493 (https://phabricator.wikimedia.org/T406836) (owner: 10DLynch) [20:55:28] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:56:49] phew I was getting scared [20:59:18] good thing you started early, I guess. That build and push took a bit. [20:59:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T415786)', diff saved to https://phabricator.wikimedia.org/P88990 and previous config saved to /var/cache/conftool/dbconfig/20260223-205921-marostegui.json [20:59:25] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2023 to codfw - jhancock@cumin2002" [20:59:26] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [20:59:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2023 to codfw - jhancock@cumin2002" [20:59:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:59:53] Yeah, apologies we're doing a chunky deploy for the baby globe project - hopefully the last of its kind [21:00:02] We're almost at testservers [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T2100). [21:00:05] toyofuku, bd808, danisztls, and Kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] o/ [21:00:09] ...it's so weird seeing an X-is-typing notification on an IRC server. [21:00:19] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2023 [21:00:29] o/ I have an UBN-fix to backport, but I can handle it myself. [21:00:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2023 [21:00:36] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2023 [21:00:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2023 [21:00:50] Sounds good - we're mid-deploy currently, but hopefully done sooner rather than later since I started half an hour ago [21:00:56] 😬 [21:01:04] !log toyofuku@deploy2002 bwang, jdrewniak, toyofuku: Backport for [[gerrit:1241026|Migrate default user preference configuration to Community Configuration (T415355)]], [[gerrit:1242471|Change "Learn more" link underneath Baby Globe on Minerva (T417077)]], [[gerrit:1242474|i18n: Update community configuration copy (T415346)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be v [21:01:04] erified there. [21:01:11] T415355: Migrate default user preference configuration to Community Configuration - https://phabricator.wikimedia.org/T415355 [21:01:11] T417077: Add link to settings below Baby Globe on Minerva - https://phabricator.wikimedia.org/T417077 [21:01:12] T415346: Enhance CommunityConfiguration UI for Birthday Mode - https://phabricator.wikimedia.org/T415346 [21:01:24] yay okay testing now [21:02:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2023.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:03:05] o/ I switched channels a bit late. sorry [21:04:00] My config change is a functional no-op that should be safe to bundle with anything else. [21:04:03] Sorry, three patches to test [21:04:08] Two more minutes or so [21:05:13] Alright, looks good [21:05:14] yolo [21:05:18] !log toyofuku@deploy2002 bwang, jdrewniak, toyofuku: Continuing with sync [21:07:07] jhancock@cumin2002 provision (PID 2159996) is awaiting input [21:07:15] toyofuku: tagging on IRC works like this. if a line starts with "nickname:" then it usually makes IRC clients show lines in a different color or makes a sound depending on user config [21:07:49] I know that one, but sometimes people tag in the middle of the sentence and I haven't figured out that one lol [21:07:57] Like how is bwang tagged above [21:09:08] oh I see, yea I think the answer is entirely "depends on their config". many people have a "highlight" for their own name even if in the middle of a line, but not everyone. the start of the line is more universal [21:09:26] irc is just text [21:09:43] you can say someone's name and they will either be notified or not [21:09:58] sometimes you will people use something like "toyo.fuku" to actively avoid pinging someone while talking about them. that practice is also debated. [21:09:59] (welcome to 1980) [21:10:03] I have to drive somewhere really quick. If this finishes up, and someone wants to roll my backport in, that's fine. Otherwise I'll be back for it soon. [21:10:39] Kemayo: no need to test it? [21:11:28] For some reason (I suspect the localization cache?) this deploy is taking 5ever [21:11:37] But we're making progress [21:11:56] Once we're done I have to attend to the fact that I don't have any heat currently...? [21:12:06] NYC blizzy [21:12:41] 🥶 [21:13:44] ♨️ [21:14:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P88991 and previous config saved to /var/cache/conftool/dbconfig/20260223-211430-marostegui.json [21:15:12] tgr_: I can test it if that comes up, I just can’t actually run the spiderpig deploy for another 30 minutes or so. [21:15:37] I can deploy it [21:15:50] I think all the other deploys are noops and can go together? [21:16:04] bd808: danisztls: ^ [21:16:39] tgr_: Mine sould be safe to bundle with pretty anything [21:16:43] *should [21:17:14] Kemayo: what needs to be deployed? [21:17:59] !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1241026|Migrate default user preference configuration to Community Configuration (T415355)]], [[gerrit:1242471|Change "Learn more" link underneath Baby Globe on Minerva (T417077)]], [[gerrit:1242474|i18n: Update community configuration copy (T415346)]] (duration: 40m 58s) [21:18:06] T415355: Migrate default user preference configuration to Community Configuration - https://phabricator.wikimedia.org/T415355 [21:18:06] T417077: Add link to settings below Baby Globe on Minerva - https://phabricator.wikimedia.org/T417077 [21:18:07] T415346: Enhance CommunityConfiguration UI for Birthday Mode - https://phabricator.wikimedia.org/T415346 [21:18:16] tgr_: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1242493 [21:18:21] And the world's longest deploy is now over! [21:18:23] Duration 50m 22s, wow [21:18:25] Thanks for waiting all [21:18:31] Yes where is my gold medal [21:19:02] a localization cache rebuild takes that long to compile and deploy [21:19:13] i think you removed it from the window accidentally, fixed: https://wikitech.wikimedia.org/w/index.php?diff=2383499 [21:19:21] tgr_: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1242493 [21:19:48] thanks, yeah, I might have edited over that [21:20:02] tgr_: it's basically a noop [21:20:12] Huh, yeah, looks like dueling edits wiped it out. [21:20:31] jhancock@cumin2002 provision (PID 2159996) is awaiting input [21:21:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240832 (https://phabricator.wikimedia.org/T411516) (owner: 10BryanDavis) [21:21:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241712 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza) [21:21:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241713 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza) [21:21:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242493 (https://phabricator.wikimedia.org/T406836) (owner: 10DLynch) [21:21:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242469 (https://phabricator.wikimedia.org/T416541) (owner: 10Bartosz Dziewoński) [21:22:23] (03Merged) 10jenkins-bot: Revert "extension-list: add a bogus extension to test l10n-update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240832 (https://phabricator.wikimedia.org/T411516) (owner: 10BryanDavis) [21:22:27] (03Merged) 10jenkins-bot: Pre-deploy Comparative Reader Research survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241712 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza) [21:22:30] (03Merged) 10jenkins-bot: Pre-deploy Comparative Reader Research survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241713 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza) [21:23:31] (03Merged) 10jenkins-bot: Edit check: catch various places where an error could derail things [extensions/VisualEditor] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242493 (https://phabricator.wikimedia.org/T406836) (owner: 10DLynch) [21:27:03] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11642586 (10ShakespeareFan00) This change to standardised size has also broken the "Preview Pagelist" functionality for editing Index... [21:27:15] (03Merged) 10jenkins-bot: UpdateAutomaticGlobalGroupMembership: Read user data from primary [extensions/CentralAuth] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1242469 (https://phabricator.wikimedia.org/T416541) (owner: 10Bartosz Dziewoński) [21:27:39] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1240832|Revert "extension-list: add a bogus extension to test l10n-update" (T411516)]], [[gerrit:1241712|Pre-deploy Comparative Reader Research survey on enwiki (T417829)]], [[gerrit:1241713|Pre-deploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1242493|Edit check: catch various places where an error could derail things (T406836 T418 [21:27:39] 173)]], [[gerrit:1242469|UpdateAutomaticGlobalGroupMembership: Read user data from primary (T416541)]] [21:27:50] T411516: Add ability to ignore missing extensions in mergeMessageFileList's `--list-file` input - https://phabricator.wikimedia.org/T411516 [21:27:51] T417829: Comparative Reader Research (Current Readers - EN) Deployment - https://phabricator.wikimedia.org/T417829 [21:27:51] T417834: Comparative Reader Research (Current Readers - ES) Deployment - https://phabricator.wikimedia.org/T417834 [21:27:52] T406836: The Edit Check's SLO has burned all its error budget - https://phabricator.wikimedia.org/T406836 [21:27:52] T418: Search by dependency (blocked by or blocking) - https://phabricator.wikimedia.org/T418 [21:27:52] T416541: Automatic global group membership is updated on unrelated local group changes - https://phabricator.wikimedia.org/T416541 [21:28:28] (03PS1) 10Btullis: Record the fact that a kerberos principal has been created for asanford [puppet] - 10https://gerrit.wikimedia.org/r/1242498 (https://phabricator.wikimedia.org/T417447) [21:28:40] oh uh, another l10n update [21:29:05] 🙀 [21:29:38] (03CR) 10Btullis: [C:03+2] Record the fact that a kerberos principal has been created for asanford [puppet] - 10https://gerrit.wikimedia.org/r/1242498 (https://phabricator.wikimedia.org/T417447) (owner: 10Btullis) [21:29:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P88992 and previous config saved to /var/cache/conftool/dbconfig/20260223-212938-marostegui.json [21:29:45] because of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1240832/ I guess [21:29:54] so this too might take a while [21:30:34] I see "0 languages rebuilt out of 545" in the log [21:31:58] !log tgr@deploy2002 matmarex, bd808, dani, tgr, kemayo: Backport for [[gerrit:1240832|Revert "extension-list: add a bogus extension to test l10n-update" (T411516)]], [[gerrit:1241712|Pre-deploy Comparative Reader Research survey on enwiki (T417829)]], [[gerrit:1241713|Pre-deploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1242493|Edit check: catch various places where an error could derail things (T [21:31:58] 406836 T418173)]], [[gerrit:1242469|UpdateAutomaticGlobalGroupMembership: Read user data from primary (T416541)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:32:05] T418173: LiftWing edit-check:predict model is 404ing - https://phabricator.wikimedia.org/T418173 [21:32:25] the slow part seems to be the container update after l10n changes [21:32:31] but yeah in this case it wasn't [21:32:37] Kemayo: can you test? [21:33:26] tgr_: Just a second to do that. [21:35:00] tgr_: Okay, it looks good. [21:35:13] (03PS1) 10Cwhite: ncredir: add wikimediastatus.net funnel [puppet] - 10https://gerrit.wikimedia.org/r/1242499 (https://phabricator.wikimedia.org/T414098) [21:35:30] !log tgr@deploy2002 matmarex, bd808, dani, tgr, kemayo: Continuing with sync [21:35:55] tgr_: Thanks for running it! [21:35:58] The slow thing is a full rebuild of the container rather than just a new layer with the rsync delta from the prior build. This is often triggered by a l10nupdate run causing a large delta in the cdb files. [21:36:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:37:33] cdb deltas are kind of random in that the cdb format is a binary where things can move around because of hashing functions [21:38:28] bd808: I found that most of the changes are a result of pointers to strings changing. Delete one character from a string and all the offsets to strings after it change. [21:38:28] so basically it just depends on whether you pass some threshold of number of bytes changed from the previous build? [21:39:02] tgr: If more than 25% of image data changes (compared to the prior image), a full image is built. [21:39:26] tgr_: thanks [21:40:31] dancy: ack. a pointer change cascade makes sense [21:41:34] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240832|Revert "extension-list: add a bogus extension to test l10n-update" (T411516)]], [[gerrit:1241712|Pre-deploy Comparative Reader Research survey on enwiki (T417829)]], [[gerrit:1241713|Pre-deploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1242493|Edit check: catch various places where an error could derail things (T406836 T41 [21:41:34] 8173)]], [[gerrit:1242469|UpdateAutomaticGlobalGroupMembership: Read user data from primary (T416541)]] (duration: 13m 56s) [21:41:45] T411516: Add ability to ignore missing extensions in mergeMessageFileList's `--list-file` input - https://phabricator.wikimedia.org/T411516 [21:41:46] T417829: Comparative Reader Research (Current Readers - EN) Deployment - https://phabricator.wikimedia.org/T417829 [21:41:46] T417834: Comparative Reader Research (Current Readers - ES) Deployment - https://phabricator.wikimedia.org/T417834 [21:41:46] T406836: The Edit Check's SLO has burned all its error budget - https://phabricator.wikimedia.org/T406836 [21:41:47] T416541: Automatic global group membership is updated on unrelated local group changes - https://phabricator.wikimedia.org/T416541 [21:42:03] !log running foreachwikiindblist sul CentralAuth:UpdateAutomaticGlobalGroupMembership --local-group=checkuser --local-group=suppress [21:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:07] I accidentally nerd sniped bvibber into working on alternate localization cache stuff in late 2025. I think she got quite deep into a new system, but I'm not sure that it has been tested in Beta Cluster yet. [21:44:16] i know we've tried building the updated cache sets for it, dunno if we've actually run the client in beta yet. who was poking at that... timo maybe? [21:44:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T415786)', diff saved to https://phabricator.wikimedia.org/P88993 and previous config saved to /var/cache/conftool/dbconfig/20260223-214447-marostegui.json [21:44:52] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [21:45:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1252.eqiad.wmnet with reason: Maintenance [21:45:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1252 (T415786)', diff saved to https://phabricator.wikimedia.org/P88994 and previous config saved to /var/cache/conftool/dbconfig/20260223-214512-marostegui.json [21:45:55] yeah, probably Krinkle. He has a good perspective on how past attempts at moving away from CDB went sideways. Not having MediaWiki on Kubernetes in Beta Cluster probably makes some testing harder too. [21:46:15] *nod* [21:47:06] yeah the first attempt to use the php opcode cache for localization caching massively spiked PHP memory usage. my rework was specifically to reduce the RAM overhead by de-duplicating the data, now that it's cheap to look up the fallback chain paths because everything's cached in ram [21:47:28] so it *should* do a lot better, but i don't know for sure it'll work well enough :) [21:47:35] I would love to see us find a way to make the worst case backport about 10 minutes instead of ~60m [21:47:39] MatmaRex: it seems like the T416541 run is going to be a noop, is that unexpected? [21:47:40] T416541: Automatic global group membership is updated on unrelated local group changes - https://phabricator.wikimedia.org/T416541 [21:48:09] (it's at f* wikis and no changes so far) [21:48:32] bvibber: i expect beta to be a "harder" environment that prod since it'll experience live changes whereas prod is basically static within a pod's life. I'm mostly afk, but let me know if you're debugging something [21:48:33] the checkuser/suppress run, I mean [21:49:35] tgr_: not entirely unexpected, that global group has been active for a while [21:50:27] Krinkle: we are just talking about where things are with that work after folks hit the "why is this backport so slow!!" tripping hazard again today. [21:52:16] so people probably got the updated global group due to some other group changes, as Johannnes89 says here https://phabricator.wikimedia.org/T416541#11586037 [21:52:36] ok, we are at 1 user now so you are probably right [21:52:44] I'll start the other deploy then [21:52:49] we might get some updates on loginwiki, since checkuser is granted there [21:53:07] bd808: ack, I'll slide back into the bushes then [21:53:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238432 (https://phabricator.wikimedia.org/T415588) (owner: 10Bartosz Dziewoński) [21:54:01] (03Merged) 10jenkins-bot: Configure rate limit class for local bots (and local-bot global group) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238432 (https://phabricator.wikimedia.org/T415588) (owner: 10Bartosz Dziewoński) [21:54:20] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1238432|Configure rate limit class for local bots (and local-bot global group) (T415588)]] [21:54:25] T415588: Add rate limit class for accounts that are in a local bot group on any wiki - https://phabricator.wikimedia.org/T415588 [21:55:08] (03CR) 10Btullis: "You will need to bump the chart version as well, or the change won't be published.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242438 (https://phabricator.wikimedia.org/T418088) (owner: 10Santiago Faci) [21:56:14] !log tgr@deploy2002 tgr, matmarex: Backport for [[gerrit:1238432|Configure rate limit class for local bots (and local-bot global group) (T415588)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:57:51] (03CR) 10Btullis: test-kitchen kubernetes chart: New config property (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242438 (https://phabricator.wikimedia.org/T418088) (owner: 10Santiago Faci) [22:00:05] Reedy, sbassett, Maryum, and manfredi: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260223T2200). [22:00:19] !log tgr@deploy2002 tgr, matmarex: Continuing with sync [22:01:49] Hey all - would like to deploy a few security patches during the window today. Are we almost done with the backport window? [22:03:40] sbassett: yeah, just wrapping up [22:04:26] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1238432|Configure rate limit class for local bots (and local-bot global group) (T415588)]] (duration: 10m 06s) [22:04:31] T415588: Add rate limit class for accounts that are in a local bot group on any wiki - https://phabricator.wikimedia.org/T415588 [22:04:31] !log UTC late deploys done [22:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:41] !log running foreachwikiindblist sul CentralAuth:UpdateAutomaticGlobalGroupMembership --local-group=bot [22:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:06] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:14:31] (03CR) 10JHathaway: [C:03+1] Remove obsolete config override for git protocol v2 [puppet] - 10https://gerrit.wikimedia.org/r/1242299 (owner: 10Muehlenhoff) [22:19:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 19.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:19:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:21:27] !log dzahn@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [22:23:56] (03PS1) 10Btullis: Move an HDFS journalnode to a newer host [puppet] - 10https://gerrit.wikimedia.org/r/1242508 (https://phabricator.wikimedia.org/T414948) [22:24:36] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8128/console" [puppet] - 10https://gerrit.wikimedia.org/r/1242508 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [22:24:36] (03CR) 10Btullis: Move an HDFS journalnode to a newer host [puppet] - 10https://gerrit.wikimedia.org/r/1242508 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [22:27:30] (03PS2) 10Btullis: Move an HDFS journalnode to a newer host [puppet] - 10https://gerrit.wikimedia.org/r/1242508 (https://phabricator.wikimedia.org/T414948) [22:27:55] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8129/console" [puppet] - 10https://gerrit.wikimedia.org/r/1242508 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [22:29:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.58% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:31:40] !log dzahn@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [22:32:15] (03CR) 10Dzahn: [C:03+2] "ERROR:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240412 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [22:32:18] (03PS3) 10Btullis: Move an HDFS journalnode to a newer host [puppet] - 10https://gerrit.wikimedia.org/r/1242508 (https://phabricator.wikimedia.org/T414948) [22:32:18] (03PS1) 10Btullis: Move a second journalnode to a newer host [puppet] - 10https://gerrit.wikimedia.org/r/1242511 (https://phabricator.wikimedia.org/T414948) [22:35:09] (03PS2) 10Dzahn: releases: upgrade Java version from 17 to 21 [puppet] - 10https://gerrit.wikimedia.org/r/1242483 (https://phabricator.wikimedia.org/T418109) [22:37:21] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1242483/8131/releases1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1242483 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [22:37:27] (03PS1) 10Btullis: Prepare to decom the old an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1242513 (https://phabricator.wikimedia.org/T414948) [22:38:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:46:54] (03PS2) 10Btullis: Prepare to decom the old an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1242513 (https://phabricator.wikimedia.org/T414948) [22:46:54] (03PS1) 10Btullis: Add the configuration for the new dse-k8s worker nodes that were an-worker [puppet] - 10https://gerrit.wikimedia.org/r/1242514 (https://phabricator.wikimedia.org/T414948) [22:49:54] (03PS1) 10Btullis: Remove the site.pp definitions for decommissioned an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1242516 (https://phabricator.wikimedia.org/T414948) [22:50:49] ! Deployed security fix for T418122 [22:51:01] !log Deployed security fix for T418122 [22:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:06] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242508 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [22:56:11] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242511 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [22:56:19] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242513 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [22:56:27] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242514 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [22:56:32] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242516 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [22:59:39] (03PS2) 10Btullis: Add the configuration for the new dse-k8s worker nodes that were an-worker [puppet] - 10https://gerrit.wikimedia.org/r/1242514 (https://phabricator.wikimedia.org/T414948) [22:59:39] (03PS2) 10Btullis: Remove the site.pp definitions for decommissioned an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1242516 (https://phabricator.wikimedia.org/T414948) [23:09:33] !log Deployed security fix for T416090 [23:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:41] (03CR) 10Ryan Kemper: [C:03+2] cleanup(WDQS): remove all remaining references to the WDQS LDF endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1237148 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [23:10:50] (03CR) 10Ryan Kemper: [C:03+2] cleanup(WDQS): remove WDQS LDF endpoint from cfssl configuration [puppet] - 10https://gerrit.wikimedia.org/r/1237147 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [23:10:53] (03CR) 10Ryan Kemper: [C:03+2] cleanup(WDQS): remove monitoring for WDQS LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237146 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [23:10:56] (03CR) 10Ryan Kemper: [C:03+2] cleanup(WDQS/traffic): cleanup backend.yaml rules for WDQS LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1237145 (https://phabricator.wikimedia.org/T415696) (owner: 10Gehel) [23:14:27] (03CR) 10Cwhite: [C:03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1242292 (owner: 10Muehlenhoff) [23:16:28] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1242297 (owner: 10Muehlenhoff) [23:18:29] !log Deployed security fix for T417603 [23:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:59] (03PS1) 10Btullis: Add the new druid-internal servers to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1242529 (https://phabricator.wikimedia.org/T417430) [23:24:31] (03CR) 10CI reject: [V:04-1] Add the new druid-internal servers to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1242529 (https://phabricator.wikimedia.org/T417430) (owner: 10Btullis) [23:25:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06), 13Patch-For-Review: Q3:rack/setup/install druid-internal100[1-6] - https://phabricator.wikimedia.org/T417430#11642947 (10BTullis) [23:26:04] (03CR) 10Manjurul: [C:03+1] Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422) (owner: 10Pppery) [23:29:04] (03PS1) 10Dwisehaupt: Fix hostname for frmx SPF records [dns] - 10https://gerrit.wikimedia.org/r/1242532 (https://phabricator.wikimedia.org/T417958) [23:38:07] (03PS1) 10Btullis: Add dbstore1010 to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1242533 (https://phabricator.wikimedia.org/T417948) [23:40:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:40:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06), 13Patch-For-Review: Q3:rack/setup/install dbstore1010 - https://phabricator.wikimedia.org/T417948#11642974 (10BTullis) [23:47:31] (03PS3) 10Pppery: Add Comments namespace for shnwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [23:48:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [23:55:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:57:45] (03PS2) 10Scott French: mesh: Copy mesh.configuration 1.15.1 to 1.15.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242517 (https://phabricator.wikimedia.org/T364245) [23:57:49] (03PS2) 10Scott French: mesh: Set traffic_direction to INBOUND on local TLS listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242518 (https://phabricator.wikimedia.org/T364245) [23:57:52] (03PS2) 10Scott French: mesh: Copy mesh.deployment 1.3.1 to 1.3.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242519 (https://phabricator.wikimedia.org/T364245) [23:57:58] (03PS2) 10Scott French: mesh: Support injection of extra env vars into envoy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245) [23:58:02] (03PS2) 10Scott French: mediawiki: Bump mesh.configuration and mesh.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242521 (https://phabricator.wikimedia.org/T364245) [23:58:05] (03PS3) 10Scott French: mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242522 (https://phabricator.wikimedia.org/T364245)