[00:00:07] (03PS2) 10E75ti: install_server: add Broadcom NIC UEFI check [puppet] - 10https://gerrit.wikimedia.org/r/1217340 (https://phabricator.wikimedia.org/T411374) [00:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:16:11] !log rzl@apt1002:~$ sudo -i reprepro -C main includedeb bullseye-wikimedia /srv/wikimedia/pool/component/envoy-future/e/envoyproxy/envoyproxy_1.35.7-1_amd64.deb # T410975 [00:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:14] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [00:16:43] !log rzl@apt1002:~$ sudo -i reprepro copy bookworm-wikimedia bullseye-wikimedia envoyproxy # T410975 [00:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:49] !log rzl@apt1002:~$ sudo -i reprepro copy trixie-wikimedia bullseye-wikimedia envoyproxy # T410975 [00:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:38] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host logging-sd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:30:48] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-sd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:40:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217343 [00:40:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217343 (owner: 10TrainBranchBot) [00:53:38] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217343 (owner: 10TrainBranchBot) [00:57:44] (03PS1) 10RLazarus: envoy: Update to v1.35.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1217344 (https://phabricator.wikimedia.org/T410975) [01:00:41] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:02:31] (03PS2) 10RLazarus: envoy: Update to v1.35.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1217344 (https://phabricator.wikimedia.org/T410975) [01:10:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217346 [01:10:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217346 (owner: 10TrainBranchBot) [01:10:56] (03CR) 10RLazarus: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1217344 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [01:23:18] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 22m 36s) [01:29:47] (03PS1) 10RLazarus: mw-*: Upgrade to Envoy 1.35.7 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217347 (https://phabricator.wikimedia.org/T410975) [01:34:01] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217346 (owner: 10TrainBranchBot) [02:27:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDo [02:29:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:32:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:34:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:06:13] (03PS1) 10DLynch: Localisation updates from https://translatewiki.net. [extensions/VisualEditor] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217352 [03:10:59] (03CR) 10Anzx: [C:03+1] "works fine when i tested logo update for T411850" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [03:14:03] (03PS1) 10DLynch: Localisation updates from https://translatewiki.net. [extensions/VisualEditor] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217354 [03:14:46] (03PS2) 10DLynch: Localisation updates from https://translatewiki.net. [extensions/VisualEditor] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217352 [03:15:12] (03Abandoned) 10DLynch: Localisation updates from https://translatewiki.net. [extensions/VisualEditor] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217354 (owner: 10DLynch) [03:23:39] (03CR) 10Stang: [C:03+1] Logos: Destandardize thumbnail sizes, handle missing responsive URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [04:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:21:29] (03PS1) 10Clare Ming: Deploy TestKitchen to Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) [04:23:14] (03CR) 10CI reject: [V:04-1] Deploy TestKitchen to Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [04:28:37] (03PS2) 10Clare Ming: Deploy TestKitchen to Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) [05:10:02] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:02] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:46:43] (03CR) 10Ayounsi: [C:03+1] Nokia ESI-LAG: Adjust module to fully remove when last LAG deleted [homer/public] - 10https://gerrit.wikimedia.org/r/1217270 (owner: 10Cathal Mooney) [06:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T0700) [07:00:05] marostegui, Amir1, and federico3: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T0700). [08:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T0800). nyaa~ [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:15:15] 06SRE, 06Infrastructure-Foundations: nodesource node20 apt mirror is broken - https://phabricator.wikimedia.org/T412342 (10Jelto) 03NEW [08:20:02] (03PS1) 10Jelto: aptrepo: Disable node20/bookworm nodesource mirror [puppet] - 10https://gerrit.wikimedia.org/r/1217456 (https://phabricator.wikimedia.org/T412342) [08:21:15] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7810/console" [puppet] - 10https://gerrit.wikimedia.org/r/1217456 (https://phabricator.wikimedia.org/T412342) (owner: 10Jelto) [08:29:32] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11450415 (10jcrespo) [08:33:45] 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11450416 (10jcrespo) [08:45:09] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11450426 (10jcrespo) >>! In T411883#11449463, @Dzahn wrote: > The user should also be added to LDAP groups "nda" (now that it's signed)... [08:45:14] (03PS1) 10Dpogorzelski: ml-build: add docker [puppet] - 10https://gerrit.wikimedia.org/r/1217460 [08:46:09] (03CR) 10Scott French: [C:03+1] envoy: Update to v1.35.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1217344 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [08:46:19] (03CR) 10Scott French: [C:03+1] "Looks good for canaries, but there's no change to mw-debug. Did you mean to include that?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217347 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [08:49:10] (03CR) 10Ayounsi: install_server: add Broadcom NIC UEFI check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1217340 (https://phabricator.wikimedia.org/T411374) (owner: 10E75ti) [08:49:29] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411374#11450435 (10ayounsi) @e75ti thanks for giving it a try, I think a better approach would be to add that behavior to the [[ https://gerrit.wikimedia.org/r/pl... [08:50:20] (03PS1) 10Brouberol: Replace temporary opensearch_test connection by staging/prod OS-ipoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217461 (https://phabricator.wikimedia.org/T408238) [08:51:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:53:04] (03PS1) 10Jcrespo: admin: Add Leif_WMDE access to the analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1217462 (https://phabricator.wikimedia.org/T411883) [08:56:42] (03PS2) 10Brouberol: Replace temporary opensearch_test connection by DC-scoped opensearch-ipoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217461 (https://phabricator.wikimedia.org/T408238) [09:01:14] (03CR) 10Kosta Harlan: [C:03+1] Replace temporary opensearch_test connection by DC-scoped opensearch-ipoid (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217461 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [09:01:45] (03CR) 10Elukey: [C:03+1] aptrepo: Disable node20/bookworm nodesource mirror [puppet] - 10https://gerrit.wikimedia.org/r/1217456 (https://phabricator.wikimedia.org/T412342) (owner: 10Jelto) [09:02:02] (03CR) 10Brouberol: Replace temporary opensearch_test connection by DC-scoped opensearch-ipoid (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217461 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [09:06:11] (03PS3) 10Brouberol: Replace temporary opensearch_test connection by DC-scoped opensearch-ipoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217461 (https://phabricator.wikimedia.org/T408238) [09:06:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:07:49] (03CR) 10Brouberol: Replace temporary opensearch_test connection by DC-scoped opensearch-ipoid (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217461 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [09:07:58] (03CR) 10CI reject: [V:04-1] Replace temporary opensearch_test connection by DC-scoped opensearch-ipoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217461 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [09:08:42] (03PS4) 10Brouberol: Replace temporary opensearch_test connection by DC-scoped opensearch-ipoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217461 (https://phabricator.wikimedia.org/T408238) [09:08:43] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1217462 (https://phabricator.wikimedia.org/T411883) (owner: 10Jcrespo) [09:08:48] (03CR) 10Elukey: aptrepo: Disable node20/bookworm nodesource mirror [puppet] - 10https://gerrit.wikimedia.org/r/1217456 (https://phabricator.wikimedia.org/T412342) (owner: 10Jelto) [09:10:09] (03PS1) 10Elukey: aptrepo: fix node20 updates config [puppet] - 10https://gerrit.wikimedia.org/r/1217465 [09:11:20] (03CR) 10Brouberol: [C:03+2] Replace temporary opensearch_test connection by DC-scoped opensearch-ipoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217461 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [09:11:56] (03CR) 10Elukey: [C:03+2] aptrepo: fix node20 updates config [puppet] - 10https://gerrit.wikimedia.org/r/1217465 (owner: 10Elukey) [09:13:48] (03CR) 10Jelto: "thank you! let's give this a try" [puppet] - 10https://gerrit.wikimedia.org/r/1217465 (owner: 10Elukey) [09:14:04] (03PS3) 10Gehel: WDQS: introduce a new role to test Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) [09:14:29] (03CR) 10Gehel: WDQS: introduce a new role to test Blazegraph alternatives (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [09:15:41] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [09:20:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [09:20:34] (03CR) 10Jcrespo: [C:03+2] admin: Add Leif_WMDE access to the analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1217462 (https://phabricator.wikimedia.org/T411883) (owner: 10Jcrespo) [09:21:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [09:21:26] (03PS4) 10Gehel: WDQS: introduce a new role to test Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) [09:21:42] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [09:21:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:24:01] (03Abandoned) 10Elukey: aptrepo: Disable node20/bookworm nodesource mirror [puppet] - 10https://gerrit.wikimedia.org/r/1217456 (https://phabricator.wikimedia.org/T412342) (owner: 10Jelto) [09:28:28] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7811/console" [puppet] - 10https://gerrit.wikimedia.org/r/1217460 (owner: 10Dpogorzelski) [09:28:50] (03PS2) 10Dpogorzelski: ml-build: add docker [puppet] - 10https://gerrit.wikimedia.org/r/1217460 [09:29:19] (03CR) 10CI reject: [V:04-1] ml-build: add docker [puppet] - 10https://gerrit.wikimedia.org/r/1217460 (owner: 10Dpogorzelski) [09:30:13] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7812/console" [puppet] - 10https://gerrit.wikimedia.org/r/1217460 (owner: 10Dpogorzelski) [09:31:19] (03PS3) 10Dpogorzelski: ml-build: add docker [puppet] - 10https://gerrit.wikimedia.org/r/1217460 [09:34:47] 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11450550 (10Aklapper) For the records I had a similar issue running the train last month in T408272#11369913; however simply trying to run `scap train` again worke... [09:36:06] (03CR) 10Cathal Mooney: [C:03+2] Nokia ESI-LAG: Adjust module to fully remove when last LAG deleted [homer/public] - 10https://gerrit.wikimedia.org/r/1217270 (owner: 10Cathal Mooney) [09:37:35] (03Merged) 10jenkins-bot: Nokia ESI-LAG: Adjust module to fully remove when last LAG deleted [homer/public] - 10https://gerrit.wikimedia.org/r/1217270 (owner: 10Cathal Mooney) [09:37:43] (03PS1) 10Slyngshede: C:external_clouds_vendors add WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1217466 (https://phabricator.wikimedia.org/T411503) [09:41:47] !log revert eqsin transport load balancing [09:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:54] (03CR) 10Dpogorzelski: [C:03+2] ml-build: add docker [puppet] - 10https://gerrit.wikimedia.org/r/1217460 (owner: 10Dpogorzelski) [10:00:22] !log revert esams transport load balancing [10:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:16] (03PS1) 10Btullis: Allow wdqs101[8-9] to mount the NFS dumps directories [puppet] - 10https://gerrit.wikimedia.org/r/1217473 (https://phabricator.wikimedia.org/T412351) [10:06:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:07:07] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7816/co" [puppet] - 10https://gerrit.wikimedia.org/r/1217473 (https://phabricator.wikimedia.org/T412351) (owner: 10Btullis) [10:11:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:13:07] (03PS1) 10Jelto: sre.gitlab.upgrade: update background migration check [cookbooks] - 10https://gerrit.wikimedia.org/r/1217474 (https://phabricator.wikimedia.org/T412276) [10:14:48] (03PS2) 10Slyngshede: P:cache::haproxy mark requests from WMCS as trusted [puppet] - 10https://gerrit.wikimedia.org/r/1217466 (https://phabricator.wikimedia.org/T411503) [10:18:52] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11450702 (10jcrespo) 05Open→03Resolved a:03jcrespo Access has been merged and deployed @Leif_WMDE please test it and reopen if... [10:19:53] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [10:19:59] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [10:20:25] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for ryankemper - https://phabricator.wikimedia.org/T412126#11450707 (10jcrespo) 05Open→03Stalled There is nothing else to do here for clinic duty until user gets back to us. [10:21:33] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11450712 (10jcrespo) @SEgt-WMF any update? [10:21:40] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11450713 (10jcrespo) a:05andrea.denisse→03None [10:23:44] (03CR) 10Arnaudb: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1217474 (https://phabricator.wikimedia.org/T412276) (owner: 10Jelto) [10:25:06] (03CR) 10Fabfur: [C:03+1] P:cache::haproxy mark requests from WMCS as trusted [puppet] - 10https://gerrit.wikimedia.org/r/1217466 (https://phabricator.wikimedia.org/T411503) (owner: 10Slyngshede) [10:26:04] (03CR) 10Btullis: [C:03+1] WDQS: introduce a new role to test Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [10:26:12] (03PS5) 10Gehel: WDQS: introduce a new role to test Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) [10:26:23] (03CR) 10Jelto: [C:03+2] "thank you for double checking!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1217474 (https://phabricator.wikimedia.org/T412276) (owner: 10Jelto) [10:27:13] (03CR) 10Gehel: [C:03+2] WDQS: introduce a new role to test Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1217238 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [10:30:12] (03PS2) 10Btullis: Allow wdqs10[28-32] to mount the NFS dumps directories [puppet] - 10https://gerrit.wikimedia.org/r/1217473 (https://phabricator.wikimedia.org/T412351) [10:31:00] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7817/co" [puppet] - 10https://gerrit.wikimedia.org/r/1217473 (https://phabricator.wikimedia.org/T412351) (owner: 10Btullis) [10:31:23] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: update background migration check [cookbooks] - 10https://gerrit.wikimedia.org/r/1217474 (https://phabricator.wikimedia.org/T412276) (owner: 10Jelto) [10:32:50] (03PS3) 10Btullis: Allow wdqs10[28-32] to mount the NFS dumps directories [puppet] - 10https://gerrit.wikimedia.org/r/1217473 (https://phabricator.wikimedia.org/T412351) [10:37:06] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [10:37:25] !log jelto@cumin1003 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [10:54:52] (03PS1) 10Jelto: sre.gitlab.upgrade: fix index out of range error in background migration check [cookbooks] - 10https://gerrit.wikimedia.org/r/1217481 (https://phabricator.wikimedia.org/T412276) [10:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:57:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [10:57:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T1100) [11:00:26] (03PS1) 10Brouberol: airflow-platform-eng: restore the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217482 (https://phabricator.wikimedia.org/T408238) [11:01:59] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [11:05:29] !log jelto@cumin1003 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [11:12:10] (03PS2) 10Jelto: sre.gitlab.upgrade: fix index out of range error in background migration check [cookbooks] - 10https://gerrit.wikimedia.org/r/1217481 (https://phabricator.wikimedia.org/T412276) [11:12:29] (03PS1) 10Elukey: aptrepo: update HP's gpg configuration [puppet] - 10https://gerrit.wikimedia.org/r/1217483 [11:27:23] (03PS1) 10Santiago Faci: wmfuniq_experiment_fetcher: Update TestKitchen API domain [puppet] - 10https://gerrit.wikimedia.org/r/1217487 (https://phabricator.wikimedia.org/T407805) [11:29:11] (03CR) 10CI reject: [V:04-1] wmfuniq_experiment_fetcher: Update TestKitchen API domain [puppet] - 10https://gerrit.wikimedia.org/r/1217487 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [11:32:43] (03PS2) 10Santiago Faci: wmfuniq_experiment_fetcher: Update TestKitchen API domain [puppet] - 10https://gerrit.wikimedia.org/r/1217487 (https://phabricator.wikimedia.org/T407805) [11:34:35] (03CR) 10Jelto: [C:04-1] "From reading https://downloads.linux.hpe.com/SDR/keys.html it looks like `26C2B797` and `74C3A4A2` are still active keys, `B1275EA3` is ex" [puppet] - 10https://gerrit.wikimedia.org/r/1217483 (owner: 10Elukey) [11:34:54] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [11:40:26] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: restore the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217482 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [11:42:13] (03PS1) 10Brouberol: airflow-platform-eng: enable egress to dse-k8s-codfw HTTPS ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217488 (https://phabricator.wikimedia.org/T408238) [11:43:34] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [11:44:09] (03CR) 10Jelto: [V:03+2 C:03+2] "verified with `test-cookbook`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1217481 (https://phabricator.wikimedia.org/T412276) (owner: 10Jelto) [11:48:01] (03CR) 10Vgutierrez: [C:03+1] "endpoint tested from a CDN node, looking good." [puppet] - 10https://gerrit.wikimedia.org/r/1217487 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [11:49:36] (03CR) 10Kosta Harlan: [C:03+1] airflow-platform-eng: enable egress to dse-k8s-codfw HTTPS ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217488 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [11:50:13] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: fix index out of range error in background migration check [cookbooks] - 10https://gerrit.wikimedia.org/r/1217481 (https://phabricator.wikimedia.org/T412276) (owner: 10Jelto) [11:54:11] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: enable egress to dse-k8s-codfw HTTPS ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217488 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [11:54:57] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T412376 (10phaultfinder) 03NEW [11:55:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:55:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:57:30] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [12:06:25] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [12:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:15:44] (03PS1) 10Btullis: Update the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217495 (https://phabricator.wikimedia.org/T410017) [12:16:55] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:17:36] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:18:52] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host aqs1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:19:21] (03CR) 10Btullis: [C:03+2] Update the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217495 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [12:21:02] (03Merged) 10jenkins-bot: Update the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217495 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [12:21:36] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217497 (owner: 10L10n-bot) [12:24:45] (03PS1) 10Dpogorzelski: ml-build: specify missing defaults [puppet] - 10https://gerrit.wikimedia.org/r/1217500 [12:24:55] (03CR) 10Dpogorzelski: [C:03+2] ml-build: specify missing defaults [puppet] - 10https://gerrit.wikimedia.org/r/1217500 (owner: 10Dpogorzelski) [12:25:32] (03PS1) 10Btullis: Bump spark-support chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217501 [12:26:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:28:19] (03PS2) 10Federico Ceratto: prometheus-mariadb-replication-lag.py: mysql_heartbeat_lag_seconds metric [puppet] - 10https://gerrit.wikimedia.org/r/1217492 (https://phabricator.wikimedia.org/T384810) [12:28:39] (03CR) 10Scott French: [C:03+1] "Thanks for adding this!" [puppet] - 10https://gerrit.wikimedia.org/r/1217189 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [12:30:00] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1025.eqiad.wmnet with OS bullseye [12:30:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11451307 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host aqs1025.eqiad.wmnet with OS bullseye [12:32:29] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host logging-sd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:32:50] (03PS4) 10Blake: service: add exclude_from_switchover field. [puppet] - 10https://gerrit.wikimedia.org/r/1217189 (https://phabricator.wikimedia.org/T412211) [12:33:06] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-sd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:34:56] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host logging-sd1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:36:28] (03CR) 10Btullis: [C:03+2] Bump spark-support chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217501 (owner: 10Btullis) [12:36:41] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host logging-sd1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:37:47] (03CR) 10Blake: [C:03+2] service: add exclude_from_switchover field. [puppet] - 10https://gerrit.wikimedia.org/r/1217189 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [12:38:22] (03Merged) 10jenkins-bot: Bump spark-support chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217501 (owner: 10Btullis) [12:40:30] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1025.eqiad.wmnet with reason: host reimage [12:44:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1025.eqiad.wmnet with reason: host reimage [12:46:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:48:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:48:32] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host logging-sd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:50:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11451381 (10Jclark-ctr) a:03Jclark-ctr [12:50:42] 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11451401 (10Urbanecm_WMF) p:05Unbreak!→03High Decreasing to //High//, as several deployments completed since. I'll try again. Leaving open in case #serviceops... [12:52:31] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [12:52:51] !log jelto@cumin1003 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [12:53:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [12:53:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [12:53:55] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [12:54:16] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host logging-sd1007.eqiad.wmnet with OS bookworm [12:54:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11451407 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host logging-sd1007.eqiad.wmnet with OS bookworm [12:54:31] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host logging-sd1006.eqiad.wmnet with OS bookworm [12:54:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11451408 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host logging-sd1006.eqiad.wmnet with OS bookworm [12:59:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:59:26] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [12:59:36] (03PS1) 10Daniel Kinzler: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T1300) [13:00:09] (03CR) 10Santiago Faci: "I would say this change doesn't depend on https://gerrit.wikimedia.org/r/q/I296204582c32d052796fad92fc65acf2d0ba7732. Test Kitchen API as " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [13:00:16] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:00:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1025.eqiad.wmnet with OS bullseye [13:00:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11451451 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host aqs1025.eqiad.wmnet with OS bullseye completed: - aqs1025 (**PASS**) -... [13:00:58] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host logging-sd1005.eqiad.wmnet with OS bookworm [13:01:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11451454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host logging-sd1005.eqiad.wmnet with OS bookworm [13:02:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11451466 (10Jclark-ctr) [13:02:36] (03CR) 10Santiago Faci: "Also, keep in mind that there is another change (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1214585) where Test Kitche" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [13:02:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11451467 (10Jclark-ctr) 05Open→03Resolved [13:03:57] 10ops-eqiad, 06SRE, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11451470 (10Jclark-ctr) @eevans the replacement server installation has been finished T407032 [13:10:07] (03PS4) 10Gehel: Allow wdqs10[28-32] to mount the NFS dumps directories [puppet] - 10https://gerrit.wikimedia.org/r/1217473 (https://phabricator.wikimedia.org/T412351) (owner: 10Btullis) [13:10:20] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1217473 (https://phabricator.wikimedia.org/T412351) (owner: 10Btullis) [13:11:16] (03PS2) 10Elukey: aptrepo: update HP's gpg configuration [puppet] - 10https://gerrit.wikimedia.org/r/1217483 [13:12:24] (03PS1) 10Blake: service: add exclude_from_switchover field. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1217520 (https://phabricator.wikimedia.org/T412211) [13:12:50] (03CR) 10Elukey: "You are totally right and I am glad that I asked for a sanity check, I trusted the GPG key without hesitation even if changing an existing" [puppet] - 10https://gerrit.wikimedia.org/r/1217483 (owner: 10Elukey) [13:13:04] (03CR) 10CI reject: [V:04-1] service: add exclude_from_switchover field. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1217520 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [13:15:24] (03CR) 10Elukey: "Ok this one has the same signature as the one removed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1217278 but different expiry" [puppet] - 10https://gerrit.wikimedia.org/r/1217483 (owner: 10Elukey) [13:15:55] (03PS2) 10Blake: service: add exclude_from_switchover field. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1217520 (https://phabricator.wikimedia.org/T412211) [13:17:45] !log krinkle@deploy2002 Started deploy [performance/navtiming@dde77b9]: Add temporary group for parsoid readviews [13:17:55] !log krinkle@deploy2002 Finished deploy [performance/navtiming@dde77b9]: Add temporary group for parsoid readviews (duration: 00m 16s) [13:20:26] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [13:20:48] cscott: ihurbain: ^ [13:21:29] (03PS3) 10Anzx: niawiktionary: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217522 (https://phabricator.wikimedia.org/T411850) [13:21:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217522 (https://phabricator.wikimedia.org/T411850) (owner: 10Anzx) [13:22:45] (03CR) 10CI reject: [V:04-1] service: add exclude_from_switchover field. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1217520 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [13:25:09] (03PS1) 10Mforns: Adjust page-analytics values to access the data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217523 (https://phabricator.wikimedia.org/T405041) [13:30:12] (03PS3) 10Blake: service: add exclude_from_switchover field. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1217520 (https://phabricator.wikimedia.org/T412211) [13:32:29] jclark@cumin1003 provision (PID 3511216) is awaiting input [13:34:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11451533 (10RobH) [13:35:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11451534 (10RobH) a:05RobH→03Jclark-ctr lvs1018's links were removed from use yesterday, so this project is now on the steps: [] [john or valerie] move a... [13:38:29] (03CR) 10CI reject: [V:04-1] service: add exclude_from_switchover field. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1217520 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [13:40:49] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd1005.eqiad.wmnet with reason: host reimage [13:43:04] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd1007.eqiad.wmnet with reason: host reimage [13:43:14] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd1006.eqiad.wmnet with reason: host reimage [13:43:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd1005.eqiad.wmnet with reason: host reimage [13:44:05] (03CR) 10Btullis: [C:03+1] Allow wdqs10[28-32] to mount the NFS dumps directories [puppet] - 10https://gerrit.wikimedia.org/r/1217473 (https://phabricator.wikimedia.org/T412351) (owner: 10Btullis) [13:44:38] (03CR) 10Elukey: [C:03+2] aptrepo: update HP's gpg configuration [puppet] - 10https://gerrit.wikimedia.org/r/1217483 (owner: 10Elukey) [13:46:19] (03CR) 10Btullis: [C:03+2] Adjust page-analytics values to access the data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217523 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns) [13:46:44] (03PS5) 10Gehel: Allow wdqs10[28-32] to mount the NFS dumps directories [puppet] - 10https://gerrit.wikimedia.org/r/1217473 (https://phabricator.wikimedia.org/T412351) (owner: 10Btullis) [13:47:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd1006.eqiad.wmnet with reason: host reimage [13:47:24] (03CR) 10Gehel: [C:03+2] Allow wdqs10[28-32] to mount the NFS dumps directories [puppet] - 10https://gerrit.wikimedia.org/r/1217473 (https://phabricator.wikimedia.org/T412351) (owner: 10Btullis) [13:47:59] (03CR) 10Jelto: [C:03+1] "looks good to me now, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1217483 (owner: 10Elukey) [13:48:05] (03Merged) 10jenkins-bot: Adjust page-analytics values to access the data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217523 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns) [13:50:23] !log restart gnmic on netflow1002 [13:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:57] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd1007.eqiad.wmnet with reason: host reimage [13:53:06] (03PS4) 10Blake: service: add exclude_from_switchover field. [software/spicerack] - 10https://gerrit.wikimedia.org/r/1217520 (https://phabricator.wikimedia.org/T412211) [13:55:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/VisualEditor] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217352 (owner: 10DLynch) [13:58:37] !log mforns@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [13:58:41] 06SRE, 06Infrastructure-Foundations, 06SRE Observability: Split the permission to access Logstash from the cn=wmf and cn=nda groups - https://phabricator.wikimedia.org/T376790#11451599 (10JAllemandou) Hi folks, as part of this ticket I lost access to logstash. Can one of you please add me to the `cn=logstash... [13:58:51] !log mforns@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [14:00:04] Urbanecm and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T1400). [14:00:05] aude, anzx, and Kemayo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] o/ [14:00:14] i am here [14:00:28] o/ [14:00:35] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:00:47] 10ops-eqiad, 06DC-Ops: Power Supply - PS1 Status - issue on cirrussearch1119:9290 - https://phabricator.wikimedia.org/T412404 (10phaultfinder) 03NEW [14:01:29] (03PS1) 10Gehel: dumps NFS: fixed typo in server names [puppet] - 10https://gerrit.wikimedia.org/r/1217530 (https://phabricator.wikimedia.org/T412351) [14:02:02] (03CR) 10Btullis: [C:03+1] dumps NFS: fixed typo in server names [puppet] - 10https://gerrit.wikimedia.org/r/1217530 (https://phabricator.wikimedia.org/T412351) (owner: 10Gehel) [14:02:14] (03CR) 10Gehel: [C:03+2] dumps NFS: fixed typo in server names [puppet] - 10https://gerrit.wikimedia.org/r/1217530 (https://phabricator.wikimedia.org/T412351) (owner: 10Gehel) [14:02:53] (03PS1) 10Btullis: Update the spark image that is deployed to analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217531 (https://phabricator.wikimedia.org/T410017) [14:03:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:03:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd1005.eqiad.wmnet with OS bookworm [14:03:54] Mine’s not testable until a config change happens later, but is going to take forever because it’ll need to rebuild localization stuff, so feel free to bundle it in with others or leave it for me to do at the end. [14:04:02] (03CR) 10Blake: "Sorry for the early review request - today I learned about the autoformatting tool!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1217520 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [14:04:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11451616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host logging-sd1005.eqiad.wmnet with OS bookworm completed: - logg... [14:04:19] 06SRE, 06Infrastructure-Foundations, 06SRE Observability: Split the permission to access Logstash from the cn=wmf and cn=nda groups - https://phabricator.wikimedia.org/T376790#11451617 (10Novem_Linguae) I think applying for logstash access is now done by visiting https://idm.wikimedia.org/permissions/ and fi... [14:04:21] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:04:23] i could deploy mine but not confident enough to +2 and deploy the logo changes [14:04:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:04:51] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd1006.eqiad.wmnet with OS bookworm [14:04:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11451618 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host logging-sd1006.eqiad.wmnet with OS bookworm completed: - logg... [14:05:17] !log mforns@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [14:05:33] !log mforns@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [14:05:38] !log mforns@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [14:05:50] !log mforns@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [14:06:35] aude: if none of other deployers are available, i can reschedule it for next window [14:06:45] (03PS1) 10Ayounsi: GNMI: disable healtz for Nokia [puppet] - 10https://gerrit.wikimedia.org/r/1217532 [14:06:59] ok sorry I am not confident enough [14:07:04] i am newish to deploying stuff [14:07:18] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [14:07:21] (03CR) 10Btullis: [C:03+2] Update the spark image that is deployed to analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217531 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:07:21] i will deploy mine, should be quick [14:07:37] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:07:51] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [14:08:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216839 (https://phabricator.wikimedia.org/T410164) (owner: 10LorenMora) [14:08:28] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:08:29] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd1007.eqiad.wmnet with OS bookworm [14:08:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11451620 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host logging-sd1007.eqiad.wmnet with OS bookworm completed: - logg... [14:08:57] (03Merged) 10jenkins-bot: [Legal Footer] Deploy Legal Footer for Phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216839 (https://phabricator.wikimedia.org/T410164) (owner: 10LorenMora) [14:09:14] (03Merged) 10jenkins-bot: Update the spark image that is deployed to analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217531 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:09:17] (03CR) 10Cathal Mooney: [C:03+1] GNMI: disable healtz for Nokia [puppet] - 10https://gerrit.wikimedia.org/r/1217532 (owner: 10Ayounsi) [14:09:22] I guess a config patch that includes images that (might?) need to get copied somewhere else is an unusual case, yeah. [14:09:30] !log aude@deploy2002 Started scap sync-world: Backport for [[gerrit:1216839|[Legal Footer] Deploy Legal Footer for Phase 1 wikis (T410164)]] [14:09:31] (03PS2) 10Ayounsi: GNMI: disable healtz for Nokia [puppet] - 10https://gerrit.wikimedia.org/r/1217532 [14:09:33] T410164: [Legal Footer] Turn on wmgUseLegalFooterContactLink config for phase 1 wikis - https://phabricator.wikimedia.org/T410164 [14:09:43] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1217532 (owner: 10Ayounsi) [14:11:36] (03CR) 10Ayounsi: [C:03+2] GNMI: disable healtz for Nokia [puppet] - 10https://gerrit.wikimedia.org/r/1217532 (owner: 10Ayounsi) [14:11:48] !log aude@deploy2002 lmora, aude: Backport for [[gerrit:1216839|[Legal Footer] Deploy Legal Footer for Phase 1 wikis (T410164)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:12:57] checking [14:14:15] 10ops-eqiad, 06DC-Ops: Power Supply - PS1 Status - issue on cirrussearch1119:9290 - https://phabricator.wikimedia.org/T412404#11451663 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Was working in same rack powercable was loose on pdu. Reseated [14:16:45] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T412376#11451674 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr updated limits per what is pending on T407628 [14:16:56] !log aude@deploy2002 lmora, aude: Continuing with sync [14:17:13] jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade. [14:17:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11451680 (10Jclark-ctr) [14:17:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11451681 (10Jclark-ctr) 05Open→03Resolved [14:20:13] jelto@cumin1003 upgrade (PID 3509200) is awaiting input [14:21:01] !log aude@deploy2002 Finished scap sync-world: Backport for [[gerrit:1216839|[Legal Footer] Deploy Legal Footer for Phase 1 wikis (T410164)]] (duration: 11m 32s) [14:21:05] T410164: [Legal Footer] Turn on wmgUseLegalFooterContactLink config for phase 1 wikis - https://phabricator.wikimedia.org/T410164 [14:21:47] I'll get mine now. [14:21:54] done with mine [14:22:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217352 (owner: 10DLynch) [14:23:34] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/VisualEditor] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217352 (owner: 10DLynch) [14:23:55] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1217352|Localisation updates from https://translatewiki.net.]] [14:25:19] (03PS1) 10Btullis: Update the kerberos settings for the hive thriftserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217535 (https://phabricator.wikimedia.org/T410017) [14:28:02] (03CR) 10Btullis: [C:03+2] Update the kerberos settings for the hive thriftserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217535 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:28:05] gehel@cumin1003 reimage (PID 3523474) is awaiting input [14:28:37] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1217352|Localisation updates from https://translatewiki.net.]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:30:04] !log kemayo@deploy2002 kemayo: Continuing with sync [14:30:21] (03Merged) 10jenkins-bot: Update the kerberos settings for the hive thriftserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217535 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:33:23] !log gehel@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1028.eqiad.wmnet with OS trixie [14:33:43] gmodena: ^ [14:33:44] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [14:34:39] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217352|Localisation updates from https://translatewiki.net.]] (duration: 10m 44s) [14:37:06] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [14:37:13] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [14:43:20] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [14:48:21] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-jumbo1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:48:34] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-jumbo1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:48:37] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-jumbo1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:50:52] !log gehel@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1028.eqiad.wmnet with reason: host reimage [14:52:02] jclark@cumin1003 provision (PID 3524683) is awaiting input [14:52:02] jclark@cumin1003 provision (PID 3524692) is awaiting input [14:52:53] jclark@cumin1003 provision (PID 3524698) is awaiting input [14:54:45] (03PS1) 10Cwhite: logstash: move logstash-ml logs to hdd-class nodes after 7d [puppet] - 10https://gerrit.wikimedia.org/r/1217539 (https://phabricator.wikimedia.org/T390215) [14:54:59] !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1028.eqiad.wmnet with reason: host reimage [14:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:59:06] (03CR) 10Cwhite: [C:03+2] logstash: move logstash-ml logs to hdd-class nodes after 7d [puppet] - 10https://gerrit.wikimedia.org/r/1217539 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [15:01:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-jumbo1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:02:21] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-jumbo1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:05:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11451821 (10Jclark-ctr) [15:06:15] (03PS3) 10Cwhite: loki: increase chunk flush interval [puppet] - 10https://gerrit.wikimedia.org/r/1069301 (https://phabricator.wikimedia.org/T335610) [15:10:35] (03CR) 10Cwhite: [C:03+2] loki: increase chunk flush interval [puppet] - 10https://gerrit.wikimedia.org/r/1069301 (https://phabricator.wikimedia.org/T335610) (owner: 10Cwhite) [15:12:24] gehel ack. Thanks! [15:13:17] !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1028.eqiad.wmnet with OS trixie [15:15:42] (03PS1) 10AOkoth: collab: add vrts junk queue alert [alerts] - 10https://gerrit.wikimedia.org/r/1217548 (https://phabricator.wikimedia.org/T408632) [15:18:26] (03CR) 10Cwhite: [C:03+1] "Looks good!" [alerts] - 10https://gerrit.wikimedia.org/r/1217548 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [15:21:56] !log gehel@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [15:24:16] !log gehel@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS trixie [15:25:10] (03CR) 10Dzahn: "shouldn't the junk queue always be low (or zero) on the inactive host though? so checking both sites should not cause an alert?" [alerts] - 10https://gerrit.wikimedia.org/r/1217548 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [15:28:39] (03CR) 10AOkoth: "The exporter is disabled on the secondary host plus the junk queue size will be the same either way since this metric is pulled from the s" [alerts] - 10https://gerrit.wikimedia.org/r/1217548 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T1530) [15:36:12] (03PS1) 10Btullis: Fix the keytab path for spark-support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217554 (https://phabricator.wikimedia.org/T410017) [15:36:21] (03CR) 10AOkoth: "Scratch that... The exporter is running but cannot scrape due to the database config using a different port." [alerts] - 10https://gerrit.wikimedia.org/r/1217548 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [15:36:33] (03CR) 10CI reject: [V:04-1] Fix the keytab path for spark-support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217554 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [15:38:17] (03PS2) 10Btullis: Fix the keytab path for spark-support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217554 (https://phabricator.wikimedia.org/T410017) [15:39:09] !log gehel@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [15:40:21] (03CR) 10Btullis: [C:03+2] Fix the keytab path for spark-support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217554 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [15:40:27] (03CR) 10AOkoth: "It's coming back to me now... The exporter did not support setting a port on the mysql connection string (it failed to parse it if you set" [alerts] - 10https://gerrit.wikimedia.org/r/1217548 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [15:41:24] !log gehel@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage [15:41:45] (03CR) 10AOkoth: "https://github.com/justwatchcom/sql_exporter/issues/40" [alerts] - 10https://gerrit.wikimedia.org/r/1217548 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [15:42:13] (03Merged) 10jenkins-bot: Fix the keytab path for spark-support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217554 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [15:43:37] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [15:43:43] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [15:43:58] !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [15:47:12] !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage [15:58:54] (03CR) 10Dr0ptp4kt: [C:03+1] wmfuniq_experiment_fetcher: Update TestKitchen API domain [puppet] - 10https://gerrit.wikimedia.org/r/1217487 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [16:00:05] Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T1600) [16:03:36] (03PS1) 10DDesouza: Partially undeploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217557 (https://phabricator.wikimedia.org/T410918) [16:05:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217557 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [16:07:48] (03CR) 10Vgutierrez: [C:03+2] wmfuniq_experiment_fetcher: Update TestKitchen API domain [puppet] - 10https://gerrit.wikimedia.org/r/1217487 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [16:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:22:57] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217558 [16:23:47] !log upload new package of corto via reprepro [16:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:00] (03PS1) 10Jcrespo: installserver: Prepare future backup hosts to be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/1217562 [16:40:49] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Sat 27 Dec 2025 04:40:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [16:41:00] 06SRE, 06Infrastructure-Foundations, 06SRE Observability: Split the permission to access Logstash from the cn=wmf and cn=nda groups - https://phabricator.wikimedia.org/T376790#11452071 (10JAllemandou) Thanks so much @Novem_Linguae , I confirm the link worked for me. [16:53:33] (03PS1) 10Joal: Revert webrequest related datasets retention [puppet] - 10https://gerrit.wikimedia.org/r/1217563 (https://phabricator.wikimedia.org/T412321) [16:56:33] jouncebot: nowandnext [16:56:33] For the next 0 hour(s) and 3 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T1600) [16:56:33] In 0 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T1700) [17:00:05] jhathaway and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T1700) [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:05:28] (03CR) 10Bking: [C:03+1] "Noting per IRC conversation that if these hosts will use UEFI, you'll need to make a new partman recipe." [puppet] - 10https://gerrit.wikimedia.org/r/1217562 (owner: 10Jcrespo) [17:10:47] (03CR) 10Jcrespo: [C:03+2] installserver: Prepare future backup hosts to be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/1217562 (owner: 10Jcrespo) [17:11:20] (03CR) 10Jcrespo: [C:03+2] "Understood, and will add that to the notes for my colleagues, hopefully I will be back to work on that before the install happens." [puppet] - 10https://gerrit.wikimedia.org/r/1217562 (owner: 10Jcrespo) [17:15:57] (03CR) 10Cwhite: [C:03+2] scap: update beta-logs logstash host to use svc record [puppet] - 10https://gerrit.wikimedia.org/r/1208048 (https://phabricator.wikimedia.org/T409363) (owner: 10Cwhite) [17:17:35] !log gehel@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS trixie [17:18:36] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#11452225 (10Dzahn) a:05Dzahn→03No... [17:20:25] !log gehel@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1030.eqiad.wmnet with OS trixie [17:27:27] (03PS1) 10Btullis: Allow the spark serviceaccount to perform more actions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217568 (https://phabricator.wikimedia.org/T410017) [17:27:40] (03PS2) 10Btullis: Allow the spark serviceaccount to perform more actions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217568 (https://phabricator.wikimedia.org/T410017) [17:33:30] (03CR) 10Btullis: [C:03+2] Allow the spark serviceaccount to perform more actions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217568 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [17:34:55] (03Merged) 10jenkins-bot: Allow the spark serviceaccount to perform more actions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217568 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [17:37:38] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [17:37:45] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [17:44:38] (03CR) 10RLazarus: [V:03+2 C:03+2] envoy: Update to v1.35.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1217344 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [17:48:15] (03PS2) 10RLazarus: mw-*: Upgrade to Envoy 1.35.7 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217347 (https://phabricator.wikimedia.org/T410975) [17:59:45] (03PS6) 10Xcollazo: Update dumps mirror Hieradata to reflect Scatter's new hostname and IP address [puppet] - 10https://gerrit.wikimedia.org/r/1216652 (https://phabricator.wikimedia.org/T409006) (owner: 10Harej) [18:00:05] bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T1800) [18:01:31] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1216652 (https://phabricator.wikimedia.org/T409006) (owner: 10Harej) [18:02:31] (03CR) 10RLazarus: [C:03+2] "Sure did, thanks for spotting it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217347 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [18:05:15] (03Merged) 10jenkins-bot: mw-*: Upgrade to Envoy 1.35.7 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217347 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [18:07:02] (03CR) 10Xcollazo: [C:03+1] "PPC run looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1216652 (https://phabricator.wikimedia.org/T409006) (owner: 10Harej) [18:35:04] (03PS2) 10DDesouza: Undeploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217557 (https://phabricator.wikimedia.org/T410918) [18:43:02] belated o/ for the MW infra window, scapping some envoy updates [18:44:09] (03PS1) 10Jasmine: charts: add Sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217570 [18:45:40] (03CR) 10CI reject: [V:04-1] charts: add Sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217570 (owner: 10Jasmine) [18:47:30] !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1217347 T410975 [18:47:34] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [18:48:30] !log rzl@deploy2002 rzl: https://gerrit.wikimedia.org/r/1217347 T410975 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:48:31] (03PS2) 10Jasmine: charts: add Sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217570 [18:50:10] !log rzl@deploy2002 rzl: Continuing with sync [18:51:48] !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1217347 T410975 (duration: 04m 54s) [18:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:01:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [19:01:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [19:02:47] !incidents [19:02:48] 7144 (UNACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [19:03:13] !ack 7144 [19:03:13] 7144 (ACKED) TransitPeeringTransportOutboundSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [19:06:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [19:06:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [19:13:55] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11452636 (10VRiley-WMF) Updated DNS name for E11 and E12 as that is the suspected problem [19:14:25] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11452637 (10RobH) [19:15:23] !log krinkle@deploy1002 sql --write wikifunctionswiki `UPDATE page SET page_touched='20251211191600' WHERE page_id=66102 LIMIT 1;` [19:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:25] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [19:25:53] (03PS1) 10Dzahn: releases: add stunnel to rsync data copy [puppet] - 10https://gerrit.wikimedia.org/r/1217572 [19:26:23] (03CR) 10CI reject: [V:04-1] releases: add stunnel to rsync data copy [puppet] - 10https://gerrit.wikimedia.org/r/1217572 (owner: 10Dzahn) [19:26:57] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dse-k8s-worker2004-5 to codfw - jhancock@cumin1003" [19:27:01] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dse-k8s-worker2004-5 to codfw - jhancock@cumin1003" [19:27:02] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:27:17] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker2004 [19:27:18] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker2005 [19:27:27] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker2004 [19:27:29] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker2005 [19:28:59] Krinkle: curious, what was the UPDATE for? [19:28:59] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:29:17] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:30:24] (03PS1) 10Urbanecm: Revert^2 "Confirmation email: further styling adjustments" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217573 (https://phabricator.wikimedia.org/T411526) [19:30:37] (03PS1) 10Urbanecm: Revert^2 "i18n: replace <> to avoid false positive export errors" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217574 (https://phabricator.wikimedia.org/T411526) [19:30:41] jouncebot: nowandnext [19:30:41] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [19:30:41] In 1 hour(s) and 29 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T2100) [19:30:50] (03CR) 10Urbanecm: [C:03+2] Revert^2 "Confirmation email: further styling adjustments" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217573 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [19:30:54] (03CR) 10Urbanecm: [C:03+2] Revert^2 "i18n: replace <> to avoid false positive export errors" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217574 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [19:33:04] (03PS1) 10Urbanecm: [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217575 (https://phabricator.wikimedia.org/T410469) [19:33:53] (03CR) 10CI reject: [V:04-1] [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217575 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm) [19:34:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:34:45] (03PS2) 10Urbanecm: [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217575 (https://phabricator.wikimedia.org/T410469) [19:36:05] (03CR) 10Urbanecm: [C:03+2] [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217575 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm) [19:36:54] (03Merged) 10jenkins-bot: [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217575 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm) [19:37:32] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1217575|[Growth] Enable Add Link backend on a handful of wikis (T410469)]] [19:37:35] T410469: Add a Link: Rollout "Add a Link" task to remaining Wikipedias that have V2 model support but don't yet have access to "Add a Link" - https://phabricator.wikimedia.org/T410469 [19:40:09] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1217575|[Growth] Enable Add Link backend on a handful of wikis (T410469)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:40:12] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:41:37] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:42:13] !log urbanecm@deploy2002 urbanecm: Continuing with sync [19:43:51] (03Merged) 10jenkins-bot: Revert^2 "Confirmation email: further styling adjustments" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217573 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [19:43:56] (03Merged) 10jenkins-bot: Revert^2 "i18n: replace <> to avoid false positive export errors" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217574 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [19:44:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:46:27] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217575|[Growth] Enable Add Link backend on a handful of wikis (T410469)]] (duration: 08m 55s) [19:46:31] T410469: Add a Link: Rollout "Add a Link" task to remaining Wikipedias that have V2 model support but don't yet have access to "Add a Link" - https://phabricator.wikimedia.org/T410469 [19:47:12] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1217573|Revert^2 "Confirmation email: further styling adjustments" (T411526)]], [[gerrit:1217574|Revert^2 "i18n: replace <> to avoid false positive export errors" (T411526)]] [19:47:16] T411526: Improve CSS styling for verification email - https://phabricator.wikimedia.org/T411526 [19:47:23] (03PS1) 10Clare Ming: Test Kitchen: StickyHeaders experiment hotfix [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217576 (https://phabricator.wikimedia.org/T412146) [19:47:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217576 (https://phabricator.wikimedia.org/T412146) (owner: 10Clare Ming) [19:48:03] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-jumbo1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:49:13] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-jumbo1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:49:21] (03PS3) 10Jasmine: charts: add Sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217570 [19:51:59] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti-jumbo1001.eqiad.wmnet with OS trixie [19:52:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11452745 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ganeti-jumbo1... [19:54:00] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti-jumbo1002.eqiad.wmnet with OS trixie [19:54:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11452754 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ganeti-jumbo1... [19:55:32] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11452756 (10ATitkov) Here are the replies I got back: - wikipedia25.org and www.wikipedia25.org should redirect to https://wikimediafoundati... [20:02:00] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-jumbo1001.eqiad.wmnet with reason: host reimage [20:05:34] (03PS2) 10Dzahn: releases: add stunnel to rsync data copy [puppet] - 10https://gerrit.wikimedia.org/r/1217572 [20:08:16] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1217572/7818/releases1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1217572 (owner: 10Dzahn) [20:08:57] (03PS3) 10Dzahn: releases: add stunnel to rsync data copy [puppet] - 10https://gerrit.wikimedia.org/r/1217572 (https://phabricator.wikimedia.org/T289858) [20:09:19] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-jumbo1001.eqiad.wmnet with reason: host reimage [20:10:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:10:59] !log gehel@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie [20:15:27] (03CR) 10Eric Gardner: [C:03+1] Test Kitchen: StickyHeaders experiment hotfix [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217576 (https://phabricator.wikimedia.org/T412146) (owner: 10Clare Ming) [20:25:18] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:28:22] jclark@cumin1003 reimage (PID 3604593) is awaiting input [20:28:29] !log gehel@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [20:28:51] (03PS11) 10Daniel Kinzler: rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 [20:34:25] !log gehel@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [20:35:10] (03CR) 10Dzahn: [C:03+2] releases: add stunnel to rsync data copy [puppet] - 10https://gerrit.wikimedia.org/r/1217572 (https://phabricator.wikimedia.org/T289858) (owner: 10Dzahn) [20:37:12] (03CR) 10BCornwall: [C:03+2] wikimediafoundation.org: Add AAAA record [dns] - 10https://gerrit.wikimedia.org/r/1217268 (https://phabricator.wikimedia.org/T403269) (owner: 10BCornwall) [20:37:47] !log brett@dns1006 START - running authdns-update [20:38:47] !log brett@dns1006 END - running authdns-update [20:40:53] !log urbanecm@deploy2002 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.4,1.46.0-wmf.5,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/me [20:40:53] diawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.229.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediaw [20:40:53] iki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.229.0) (duration: 53m 41s) [20:41:06] ááááá [20:41:30] rzl: i hate to inform you it probably _is_ patch specific :/ [20:43:03] urbanecm: in that case I think this is probably a question for releng [20:43:13] there's some expertise in serviceops but it's all in Lisbon right now [20:43:36] yeah, makes sense. well, my next attempt is going to be on Monday the earliest anyway [20:43:45] shouldn't do deployments on a friday [20:44:56] (03PS1) 10Xcollazo: Scale up mw-content-history-reconcile-enrich temporarily for big reconcile. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217579 [20:45:39] (03PS2) 10Xcollazo: Scale up mw-content-history-reconcile-enrich temporarily for big reconcile. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217579 (https://phabricator.wikimedia.org/T411803) [20:45:50] well, let's revert it again :/ [20:46:35] 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11452914 (10Urbanecm_WMF) Okay, I tried deploying the two patches (https://gerrit.wikimedia.org/r/1217573, https://gerrit.wikimedia.org/r/1217574) again. Same erro... [20:46:55] rzl: should i tag the task with anything else too? [20:47:19] that too is a question for releng :) [20:47:31] xcollazo: do you need a puppet merge or you are good? [20:47:36] (03PS1) 10Urbanecm: Revert^3 "Confirmation email: further styling adjustments" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217580 (https://phabricator.wikimedia.org/T411526) [20:47:48] (03PS1) 10Urbanecm: Revert^3 "i18n: replace <> to avoid false positive export errors" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217581 (https://phabricator.wikimedia.org/T411526) [20:47:52] (03CR) 10TChin: [C:03+1] Scale up mw-content-history-reconcile-enrich temporarily for big reconcile. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217579 (https://phabricator.wikimedia.org/T411803) (owner: 10Xcollazo) [20:48:00] (03CR) 10Urbanecm: [V:03+2 C:03+2] Revert^3 "Confirmation email: further styling adjustments" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217580 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [20:48:03] (03CR) 10Urbanecm: [V:03+2 C:03+2] Revert^3 "i18n: replace <> to avoid false positive export errors" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217581 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [20:49:00] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1217580|Revert^3 "Confirmation email: further styling adjustments" (T411526)]], [[gerrit:1217581|Revert^3 "i18n: replace <> to avoid false positive export errors" (T411526)]] [20:49:04] T411526: Improve CSS styling for verification email - https://phabricator.wikimedia.org/T411526 [20:50:05] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:50:06] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-jumbo1001.eqiad.wmnet with OS trixie [20:50:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11452923 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ganeti-jumbo1001.... [20:51:07] mutante: I could use a +2 on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1217579, yes, thanks [20:51:46] (03CR) 10TChin: [C:03+2] Scale up mw-content-history-reconcile-enrich temporarily for big reconcile. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217579 (https://phabricator.wikimedia.org/T411803) (owner: 10Xcollazo) [20:53:27] (03Merged) 10jenkins-bot: Scale up mw-content-history-reconcile-enrich temporarily for big reconcile. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217579 (https://phabricator.wikimedia.org/T411803) (owner: 10Xcollazo) [20:54:03] xcollazo: Oh, I had this in mind: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1216652 that is another repo that would also need deployment [20:54:18] that = the patch in deployment-charts (vs a puppet change) [20:55:04] mutante: ah, yes, if that one looks good to you, please +2 [20:55:53] (03PS1) 10Majavah: wikimediafoundation.org: Add AAAA for non-apex records as well [dns] - 10https://gerrit.wikimedia.org/r/1217582 (https://phabricator.wikimedia.org/T403269) [20:56:29] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-jumbo1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:57:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-jumbo1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:00:03] (03CR) 10Dzahn: [C:03+2] Update dumps mirror Hieradata to reflect Scatter's new hostname and IP address [puppet] - 10https://gerrit.wikimedia.org/r/1216652 (https://phabricator.wikimedia.org/T409006) (owner: 10Harej) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T2100). Please do the needful. [21:00:05] danisztls, anzx, and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] o/ [21:00:20] o/ [21:00:23] I can self-deploy [21:00:31] * urbanecm is currently running scap [21:00:38] yep, no problem [21:00:57] xcollazo: done now (got a phone call) [21:00:57] anzx: i can deploy for you if you need a deployer [21:01:09] xcollazo: the other one I would prefer to leave for others [21:01:28] i attempted to re-deploy sth, and i run into T412265 again. currently blank-deploying the reverts again, to get everything back to sync [21:01:29] T412265: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265 [21:01:42] xcollazo: in another matter.. earlier people asked on other channels about wikidata dumps. maybe you can take a look? it was like this one but a different user: https://phabricator.wikimedia.org/T412428 [21:01:54] not sure how long that'll take, but given it's doing a build, it might run through the window :/ [21:01:55] like the wikidata data dumps stopped generating [21:02:06] eek [21:02:27] (apologies, but scaps taking 70 minutes are VERY hard to plan around) [21:02:58] no worries - will we still be able to deploy our changes after you're done urbanecm? [21:03:32] mutante: it is generating, see https://phabricator.wikimedia.org/T412428#11452808, but seems like it is not showing up on CloudVPS instances [21:03:46] cjming: i think so, but i can't guarantee when it'll finish :/ [21:04:01] xcollazo: aha! gotcha! yea. in that case probably needs wmcs to check. makes sense [21:04:53] urbanecm: ack - thanks for the heads up [21:05:09] mutatante: do you know who is a good contact to ping from wmcs? [21:07:58] xcollazo: just start talking on the -cloud channel and if there is no response then use the !help bot trigger to get the attention [21:08:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:09:10] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [21:09:29] 10ops-codfw, 10ops-eqiad, 06DC-Ops: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458 (10Jhancock.wm) 03NEW [21:10:27] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [21:12:11] (03CR) 10Dzahn: [C:03+2] "2025.12.11 21:11:33 LOG3[0]: s_connect: connect 2620:0:860:102:10:192:16:72:1873: Connection refused (111)" [puppet] - 10https://gerrit.wikimedia.org/r/1217572 (https://phabricator.wikimedia.org/T289858) (owner: 10Dzahn) [21:13:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:14:14] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti-jumbo1002.eqiad.wmnet with OS trixie [21:14:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11453003 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ganeti-jumbo1002.... [21:17:21] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [21:17:32] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [21:18:28] mutante: got it, thanks [21:21:38] cjming: build has finished, so ~10 more minutes should be enough [21:22:06] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1217580|Revert^3 "Confirmation email: further styling adjustments" (T411526)]], [[gerrit:1217581|Revert^3 "i18n: replace <> to avoid false positive export errors" (T411526)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:22:10] T411526: Improve CSS styling for verification email - https://phabricator.wikimedia.org/T411526 [21:22:28] !log urbanecm@deploy2002 urbanecm: Continuing with sync [21:22:34] urbanecm: awesome - thanks [21:25:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11453052 (10VRiley-WMF) a:03VRiley-WMF [21:26:55] 06SRE, 06collaboration-services, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858#11453061 (10Dzahn) [21:31:08] (03CR) 10Xcollazo: [C:03+1] Revert webrequest related datasets retention [puppet] - 10https://gerrit.wikimedia.org/r/1217563 (https://phabricator.wikimedia.org/T412321) (owner: 10Joal) [21:31:16] cjming: if you want I can deploy yours together with mine to save us time [21:32:32] danisztls: that would be fantastic if you don't mind [21:32:54] cjming: great! [21:35:04] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217580|Revert^3 "Confirmation email: further styling adjustments" (T411526)]], [[gerrit:1217581|Revert^3 "i18n: replace <> to avoid false positive export errors" (T411526)]] (duration: 46m 04s) [21:35:08] T411526: Improve CSS styling for verification email - https://phabricator.wikimedia.org/T411526 [21:35:16] and here we go, back to normal [21:35:18] danisztls: over to you [21:35:23] (or whoever is deploying) [21:35:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217557 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [21:35:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217576 (https://phabricator.wikimedia.org/T412146) (owner: 10Clare Ming) [21:35:41] urbanecm: thanks! [21:35:48] ty! [21:37:47] (03Merged) 10jenkins-bot: Undeploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217557 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [21:40:50] merging is super slow today [21:42:00] danisztls: for a backport? [21:42:24] I'd say ~7 mins is quite reasoable for wmf. backports [21:43:36] i've waited for merges to take 22+ minutes -- those are painful [21:43:47] urbanecm: I mean just the merge, not the build. [21:44:52] i know, but i'm not seeing anything odd, that's all [21:44:56] ok [21:44:58] no worries [21:45:05] anyway, should be here any sec [21:45:29] (03Merged) 10jenkins-bot: Test Kitchen: StickyHeaders experiment hotfix [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217576 (https://phabricator.wikimedia.org/T412146) (owner: 10Clare Ming) [21:45:48] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1217557|Undeploy 2025 Global Readers Survey (T410918)]], [[gerrit:1217576|Test Kitchen: StickyHeaders experiment hotfix (T412146)]] [21:45:53] T410918: Deploy 2025 Global Readers Surveys (non-English) - https://phabricator.wikimedia.org/T410918 [21:45:54] T412146: Launch Mobile Expanded Sections on non-English wikis - https://phabricator.wikimedia.org/T412146 [21:47:47] !log dani@deploy2002 dani, cjming: Backport for [[gerrit:1217557|Undeploy 2025 Global Readers Survey (T410918)]], [[gerrit:1217576|Test Kitchen: StickyHeaders experiment hotfix (T412146)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:48:21] danisztls: gtg from my side [21:48:22] cjming: it's available for testing [21:48:27] cjming: thanks [21:49:09] !log dani@deploy2002 dani, cjming: Continuing with sync [21:52:45] !log gehel@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1031.eqiad.wmnet with OS trixie [21:54:55] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217557|Undeploy 2025 Global Readers Survey (T410918)]], [[gerrit:1217576|Test Kitchen: StickyHeaders experiment hotfix (T412146)]] (duration: 09m 07s) [21:55:00] T410918: Deploy 2025 Global Readers Surveys (non-English) - https://phabricator.wikimedia.org/T410918 [21:55:01] T412146: Launch Mobile Expanded Sections on non-English wikis - https://phabricator.wikimedia.org/T412146 [21:55:15] all done [21:55:22] tysm! [21:55:55] anzx: if you're around, happy to deploy your config patches for you - just lmk [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251211T2200) [22:05:19] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2004.codfw.wmnet with OS bookworm [22:05:30] 10ops-codfw, 06SRE, 06DC-Ops, 07Essential-Work: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11453193 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host dse-k8s-worker2004.codfw.wmnet with OS bookworm [22:05:36] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2005.codfw.wmnet with OS bookworm [22:05:43] 10ops-codfw, 06SRE, 06DC-Ops, 07Essential-Work: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11453194 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host dse-k8s-worker2005.codfw.wmnet with OS bookworm [22:05:57] hi I would like to do a security deployment [22:06:42] maryum: hi ! i think the backport window is done - all yours [22:06:49] awesome thanks [22:14:44] (03PS1) 10Dzahn: releases: use stunnel with rsync from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1217594 [22:16:04] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker2004.codfw.wmnet with reason: host reimage [22:19:58] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker2004.codfw.wmnet with reason: host reimage [22:21:51] had a scap failure, going to run scap again [22:22:26] maryum: I want to do a small backport to beta cluster. Let me know when I can have the conch! [22:22:35] I definitely will Jdlrobson [22:29:06] scap just finished [22:29:55] Jdlrobson: finished with scap [22:30:03] !log Deployed security fix for T411305 [22:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:53] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:40:20] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:40:21] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker2004.codfw.wmnet with OS bookworm [22:40:27] 10ops-codfw, 06SRE, 06DC-Ops, 07Essential-Work: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11453242 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host dse-k8s-worker2004.codfw.wmnet with OS bookworm completed: - ds... [22:41:06] 10ops-codfw, 06SRE, 06DC-Ops, 07Essential-Work: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11453245 (10Jhancock.wm) [22:41:52] 10ops-codfw, 06SRE, 06DC-Ops, 07Essential-Work: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11453247 (10Jhancock.wm) need to check the dac in the morning. no media detected on reimage. whomp [22:43:06] (03PS1) 10Bking: opensearch-on-k8s: enable certificate hot reloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217597 (https://phabricator.wikimedia.org/T412447) [22:43:52] (03CR) 10Ryan Kemper: [C:03+1] opensearch-on-k8s: enable certificate hot reloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217597 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [22:44:41] (03CR) 10CI reject: [V:04-1] opensearch-on-k8s: enable certificate hot reloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217597 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [22:45:55] (03PS2) 10Bking: opensearch-on-k8s: enable certificate hot reloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217597 (https://phabricator.wikimedia.org/T412447) [22:48:15] (03CR) 10Bking: [C:03+2] opensearch-on-k8s: enable certificate hot reloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217597 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [22:49:13] maryum: doing this now! Thanks! [22:49:32] (03PS1) 10Jdlrobson: Enable MinervaPersonalMenu on beta for all logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217595 (https://phabricator.wikimedia.org/T404227) [22:49:47] (03PS2) 10Jdlrobson: Enable MinervaPersonalMenu on beta for all logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217595 (https://phabricator.wikimedia.org/T404227) [22:50:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217595 (https://phabricator.wikimedia.org/T404227) (owner: 10Jdlrobson) [22:51:21] (03Merged) 10jenkins-bot: Enable MinervaPersonalMenu on beta for all logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217595 (https://phabricator.wikimedia.org/T404227) (owner: 10Jdlrobson) [22:52:30] and done :) [22:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:09:46] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:12:20] (03PS1) 10Bking: opensearch-on-k8s: Enable Reload Certificates API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217599 (https://phabricator.wikimedia.org/T412447) [23:13:10] (03CR) 10Ryan Kemper: [C:03+1] opensearch-on-k8s: Enable Reload Certificates API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217599 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [23:14:14] (03PS1) 10DLynch: Add product_metrics.contributors.experiments to wgMetricsPlatformExperimentStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217600 (https://phabricator.wikimedia.org/T405177) [23:15:11] (03CR) 10Bking: [C:03+2] opensearch-on-k8s: Enable Reload Certificates API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217599 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [23:16:34] (03CR) 10Bearloga: [C:03+2] Add product_metrics.contributors.experiments to wgMetricsPlatformExperimentStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217600 (https://phabricator.wikimedia.org/T405177) (owner: 10DLynch) [23:16:55] (03Merged) 10jenkins-bot: opensearch-on-k8s: Enable Reload Certificates API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217599 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [23:18:03] (03CR) 10Bearloga: [C:03+1] Add product_metrics.contributors.experiments to wgMetricsPlatformExperimentStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217600 (https://phabricator.wikimedia.org/T405177) (owner: 10DLynch) [23:18:33] Web window is all done, so I'm going to deploy that config patch. [23:18:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217600 (https://phabricator.wikimedia.org/T405177) (owner: 10DLynch) [23:19:42] (03Merged) 10jenkins-bot: Add product_metrics.contributors.experiments to wgMetricsPlatformExperimentStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217600 (https://phabricator.wikimedia.org/T405177) (owner: 10DLynch) [23:20:03] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1217600|Add product_metrics.contributors.experiments to wgMetricsPlatformExperimentStreamNames (T405177 T410803)]] [23:20:08] T405177: Revise Tone: Instrumentation - https://phabricator.wikimedia.org/T405177 [23:20:08] T410803: Create data stream for mobile web section editing dead-end intervention - https://phabricator.wikimedia.org/T410803 [23:21:58] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1217600|Add product_metrics.contributors.experiments to wgMetricsPlatformExperimentStreamNames (T405177 T410803)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:23:37] !log kemayo@deploy2002 kemayo: Continuing with sync [23:25:52] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker2005.codfw.wmnet with OS bookworm [23:26:05] 10ops-codfw, 06SRE, 06DC-Ops, 07Essential-Work: Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11453333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host dse-k8s-worker2005.codfw.wmnet with OS bookworm executed with e... [23:27:41] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217600|Add product_metrics.contributors.experiments to wgMetricsPlatformExperimentStreamNames (T405177 T410803)]] (duration: 07m 38s) [23:27:46] T405177: Revise Tone: Instrumentation - https://phabricator.wikimedia.org/T405177 [23:27:47] T410803: Create data stream for mobile web section editing dead-end intervention - https://phabricator.wikimedia.org/T410803 [23:29:18] @Kemayo are you still deploying? I have a small follow up that's beta only [23:29:30] Jdlrobson: I'm done now. [23:29:40] thanks I just need to +2 it should be 5 m [23:29:54] (03PS1) 10Jdlrobson: Correct syntax for wgMinervaPersonalMenu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217604 (https://phabricator.wikimedia.org/T404227) [23:30:25] (03PS1) 10Bking: opensearch-on-k8s: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217606 (https://phabricator.wikimedia.org/T412447) [23:30:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217604 (https://phabricator.wikimedia.org/T404227) (owner: 10Jdlrobson) [23:31:24] (03Merged) 10jenkins-bot: Correct syntax for wgMinervaPersonalMenu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217604 (https://phabricator.wikimedia.org/T404227) (owner: 10Jdlrobson) [23:31:54] (done) [23:33:31] (03CR) 10Ryan Kemper: [C:03+1] opensearch-on-k8s: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217606 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [23:52:10] 06SRE, 06collaboration-services, 10MW-on-K8s, 06serviceops: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858#11453352 (10Dzahn) a:03Dzahn [23:56:20] (03PS2) 10Dzahn: releases: use stunnel with rsync from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1217594 (https://phabricator.wikimedia.org/T289858) [23:58:08] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1217594/7819/" [puppet] - 10https://gerrit.wikimedia.org/r/1217594 (https://phabricator.wikimedia.org/T289858) (owner: 10Dzahn)