[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0000) [00:00:46] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [00:03:01] (03CR) 10Aklapper: [V:03+2 C:03+2] Replace backtick operator with shell_exec [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1218364 (owner: 10Pppery) [00:10:15] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:10:19] (03PS1) 10Dzahn: Revert "deployment::server: allow releases hosts encrypted rsync" [puppet] - 10https://gerrit.wikimedia.org/r/1218383 [00:12:19] (03CR) 10Dzahn: [C:03+2] Revert "deployment::server: allow releases hosts encrypted rsync" [puppet] - 10https://gerrit.wikimedia.org/r/1218383 (owner: 10Dzahn) [00:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 18.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:33:17] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:50] (03CR) 10RLazarus: [C:03+2] {api,rest}-gateway: Update to Envoy 1.35.7 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217611 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [00:40:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218385 [00:40:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218385 (owner: 10TrainBranchBot) [00:41:57] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [00:42:13] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [00:43:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:43:52] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [00:44:16] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [00:45:54] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [00:46:17] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [00:48:17] FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:48:52] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [00:49:08] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [00:50:15] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [00:50:30] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [00:52:49] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [00:52:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218385 (owner: 10TrainBranchBot) [00:53:01] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [01:00:39] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:54] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 01m 15s) [01:10:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218390 [01:10:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218390 (owner: 10TrainBranchBot) [01:10:07] 06SRE, 06collaboration-services, 10MW-on-K8s, 06serviceops: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858#11462618 (10Dzahn) 05Open→03Resolved file transfers to and between releases servers are now encrypted [01:11:27] 06SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wiki Debates - https://phabricator.wikimedia.org/T412017#11462622 (10Dzahn) Hi @Gnangarra what do you think? Do you just want to take over the existing Wikidebate list? [01:13:07] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 35357912 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:14:07] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3199168 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:31:13] PROBLEM - dump of matomo in eqiad on backupmon1001 is CRITICAL: Last dump for matomo at eqiad (db1208) taken on 2025-12-16 01:07:57 is 436 MiB, but the previous one was 537 MiB, a change of -18.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:34:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218390 (owner: 10TrainBranchBot) [01:55:36] (03CR) 10Ladsgroup: [C:03+2] SpecialLinkSearch: Add a message when domains are being ignored [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218295 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [02:06:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86634 and previous config saved to /var/cache/conftool/dbconfig/20251216-020611-marostegui.json [02:06:17] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:06:18] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:07:29] (03Merged) 10jenkins-bot: SpecialLinkSearch: Add a message when domains are being ignored [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218295 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [02:07:58] (03CR) 10Ladsgroup: [C:03+2] SpecialLinkSearch: Move ignored-domains msg to bottom [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218314 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [02:09:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.7 [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) [02:09:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.7 [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [02:09:35] (03CR) 10Ladsgroup: "With the cherry-pick, it doesn't move the message, it adds to to bottom too :/" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218314 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [02:09:45] (03Abandoned) 10Ladsgroup: SpecialLinkSearch: Move ignored-domains msg to bottom [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218314 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [02:11:20] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1218295|SpecialLinkSearch: Add a message when domains are being ignored (T405005)]] [02:11:24] T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005 [02:21:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P86635 and previous config saved to /var/cache/conftool/dbconfig/20251216-022119-marostegui.json [02:21:28] (03PS1) 10Clare Ming: Update references to `product_metrics` to `test_kitchen` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218395 (https://phabricator.wikimedia.org/T407906) [02:22:16] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/1.46.0-wmf.7 [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [02:27:02] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:36:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P86636 and previous config saved to /var/cache/conftool/dbconfig/20251216-023627-marostegui.json [02:36:39] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1218295|SpecialLinkSearch: Add a message when domains are being ignored (T405005)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:36:44] T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005 [02:37:31] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [02:50:07] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218295|SpecialLinkSearch: Add a message when domains are being ignored (T405005)]] (duration: 38m 47s) [02:50:11] T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005 [02:51:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86637 and previous config saved to /var/cache/conftool/dbconfig/20251216-025136-marostegui.json [02:51:42] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:51:44] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:51:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [02:52:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2168 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86638 and previous config saved to /var/cache/conftool/dbconfig/20251216-025200-marostegui.json [02:53:07] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 319283440 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:54:07] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 23752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:55:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0300) [03:02:03] (03CR) 10Clare Ming: [C:04-2] "need to wait until https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/226 propagates everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [03:18:15] PROBLEM - Host an-druid1005 is DOWN: PING CRITICAL - Packet loss = 20%, RTA = 2746.82 ms [03:18:55] RECOVERY - Host an-druid1005 is UP: PING OK - Packet loss = 0%, RTA = 21.67 ms [03:39:31] (03CR) 10Clare Ming: "not sure if we want to update stream names with `product_metrics` in them or not" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218395 (https://phabricator.wikimedia.org/T407906) (owner: 10Clare Ming) [03:50:46] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [04:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0400) [04:10:15] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:15:46] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [04:47:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:48:32] FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:52:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0500) [05:10:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [05:16:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86639 and previous config saved to /var/cache/conftool/dbconfig/20251216-051607-marostegui.json [05:16:13] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:16:13] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:35:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:27:02] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:35:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86640 and previous config saved to /var/cache/conftool/dbconfig/20251216-063525-marostegui.json [06:35:31] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:35:32] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:50:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P86641 and previous config saved to /var/cache/conftool/dbconfig/20251216-065033-marostegui.json [06:55:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:55:39] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:58:33] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0700) [07:00:04] marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0700). [07:05:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P86642 and previous config saved to /var/cache/conftool/dbconfig/20251216-070542-marostegui.json [07:10:13] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755#11462838 (10Marostegui) p:05Triage→03Medium a:03CDobbins I assume you'd take care of this yourself? If you need help from Clinic Duty person let me know! [07:18:24] (03PS1) 10Marostegui: isntallserver: Do not format /srv on es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1218652 (https://phabricator.wikimedia.org/T407472) [07:20:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86643 and previous config saved to /var/cache/conftool/dbconfig/20251216-072049-marostegui.json [07:20:55] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [07:20:55] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [07:21:05] (03CR) 10Marostegui: [C:03+2] isntallserver: Do not format /srv on es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1218652 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui) [07:21:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:21:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1181 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86644 and previous config saved to /var/cache/conftool/dbconfig/20251216-072114-marostegui.json [07:22:36] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11462865 (10ABran-WMF) a:03ABran-WMF [07:27:55] (03Abandoned) 10Ayounsi: interface::alias: update define to get prefix len from netmask [puppet] - 10https://gerrit.wikimedia.org/r/931237 (https://phabricator.wikimedia.org/T336864) (owner: 10Jbond) [07:30:27] (03CR) 10Ayounsi: [C:03+1] interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans) [07:33:08] (03CR) 10Ayounsi: [C:03+2] Netbox scripts: remove the scheduling UI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218210 (owner: 10Ayounsi) [07:34:58] (03Merged) 10jenkins-bot: Netbox scripts: remove the scheduling UI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218210 (owner: 10Ayounsi) [07:36:54] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [07:37:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [07:47:28] (03CR) 10Itamar Givon: [C:03+1] Use relative path for "latest" symlinks [dumps] - 10https://gerrit.wikimedia.org/r/1218317 (https://phabricator.wikimedia.org/T412726) (owner: 10Jakob) [07:59:43] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/1215549 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0800). [08:00:05] hamishcz and akosiaris: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:02:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86645 and previous config saved to /var/cache/conftool/dbconfig/20251216-080227-marostegui.json [08:02:33] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:02:34] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:09:53] (03PS1) 10Muehlenhoff: Remove puppetmaster::backend role from puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/1218704 (https://phabricator.wikimedia.org/T365798) [08:10:15] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:11:59] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS bookworm [08:17:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P86646 and previous config saved to /var/cache/conftool/dbconfig/20251216-081735-marostegui.json [08:22:41] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::backend role from puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/1218704 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:27:56] (03CR) 10Ayounsi: [C:03+2] interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans) [08:29:37] (03PS2) 10Muehlenhoff: Add Hugh as approver for mw-log-readers and logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214078 (https://phabricator.wikimedia.org/T276465) [08:32:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P86647 and previous config saved to /var/cache/conftool/dbconfig/20251216-083243-marostegui.json [08:33:36] (03PS8) 10Dpogorzelski: ml-build: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1218211 [08:33:44] (03CR) 10Dpogorzelski: ml-build: add docker-pkg (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [08:37:08] (03PS1) 10Muehlenhoff: Remove puppetmaster2002 from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1218706 (https://phabricator.wikimedia.org/T365798) [08:39:32] (03PS1) 10Dpogorzelski: docker_registry: allow ml-build1001 [puppet] - 10https://gerrit.wikimedia.org/r/1218707 (https://phabricator.wikimedia.org/T412524) [08:40:55] (03PS1) 10Jelto: gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) [08:41:24] (03CR) 10CI reject: [V:04-1] gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [08:42:59] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for ryankemper - https://phabricator.wikimedia.org/T412126#11462957 (10MoritzMuehlenhoff) [08:43:11] (03PS2) 10Jelto: gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) [08:45:04] (03CR) 10CI reject: [V:04-1] gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [08:45:25] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster2002 from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1218706 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:46:34] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [08:47:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86648 and previous config saved to /var/cache/conftool/dbconfig/20251216-084752-marostegui.json [08:47:58] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:47:59] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:48:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [08:48:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2182 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86649 and previous config saved to /var/cache/conftool/dbconfig/20251216-084817-marostegui.json [08:48:32] FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:50:37] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11462983 (10ayounsi) [08:51:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T410589)', diff saved to https://phabricator.wikimedia.org/P86650 and previous config saved to /var/cache/conftool/dbconfig/20251216-085155-ladsgroup.json [08:52:00] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [08:53:28] (03PS1) 10Aqu: postgresql-airflow-main: Increase pgbouncer pool size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990) [08:55:16] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetmaster2002.codfw.wmnet [08:58:05] (03PS1) 10Jelto: interface::alias: use wmflib::mask2cidr instead of netmask_to_cidr [puppet] - 10https://gerrit.wikimedia.org/r/1218710 (https://phabricator.wikimedia.org/T336864) [08:58:46] (03PS3) 10Jelto: gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) [09:00:03] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#11463009 (10MoritzMuehlenhoff) [09:01:30] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:02:02] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7825/co" [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [09:04:52] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:06:03] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7826/console" [puppet] - 10https://gerrit.wikimedia.org/r/1218710 (https://phabricator.wikimedia.org/T336864) (owner: 10Jelto) [09:06:03] (03PS1) 10Muehlenhoff: Remove puppetmaster2002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1218711 (https://phabricator.wikimedia.org/T412783) [09:07:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P86651 and previous config saved to /var/cache/conftool/dbconfig/20251216-090704-ladsgroup.json [09:07:57] jmm@cumin2002 decommission (PID 2673345) is awaiting input [09:12:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:12:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:12:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster2002.codfw.wmnet [09:12:43] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#11463027 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `puppetmaster2002.codfw.wmnet` - puppetmaster2002.... [09:12:53] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster2002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1218711 (https://phabricator.wikimedia.org/T412783) (owner: 10Muehlenhoff) [09:13:51] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission puppetmaster2002 - https://phabricator.wikimedia.org/T412783#11463029 (10MoritzMuehlenhoff) [09:19:11] (03CR) 10Elukey: [C:03+2] team-sre: avoid cert-expiry alerts for staging endpoints [alerts] - 10https://gerrit.wikimedia.org/r/1217107 (owner: 10Elukey) [09:20:37] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add primary IP to ps1-e10-eqiad - ayounsi@cumin1003" [09:21:17] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add primary IP to ps1-e10-eqiad - ayounsi@cumin1003" [09:22:00] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [09:22:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P86652 and previous config saved to /var/cache/conftool/dbconfig/20251216-092212-ladsgroup.json [09:27:18] (03PS2) 10Elukey: Pyrra: add the MWH completeness SLO under Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) [09:27:28] (03CR) 10Elukey: Pyrra: add the MWH completeness SLO under Data Platform (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) (owner: 10Elukey) [09:28:18] (03PS4) 10Jelto: gitlab: use real netmask in interface::alias on replicas [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) [09:32:05] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [09:32:42] (03PS2) 10Muehlenhoff: Remove puppetmaster1003 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1215202 (https://phabricator.wikimedia.org/T365798) [09:34:13] (03CR) 10Ayounsi: [C:03+1] "nicely done!" [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [09:34:37] (03CR) 10Ayounsi: [C:03+1] "lgtm! especially as it's a NOOP for now." [puppet] - 10https://gerrit.wikimedia.org/r/1218710 (https://phabricator.wikimedia.org/T336864) (owner: 10Jelto) [09:35:10] (03PS1) 10Fabfur: P::cache::haproxy: enable QOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) [09:37:12] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [09:37:17] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster1003 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1215202 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:37:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T410589)', diff saved to https://phabricator.wikimedia.org/P86653 and previous config saved to /var/cache/conftool/dbconfig/20251216-093720-ladsgroup.json [09:37:25] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [09:37:38] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2209.codfw.wmnet with reason: Maintenance [09:37:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T410589)', diff saved to https://phabricator.wikimedia.org/P86654 and previous config saved to /var/cache/conftool/dbconfig/20251216-093745-ladsgroup.json [09:39:25] (03CR) 10Jelto: [V:03+1 C:03+2] interface::alias: use wmflib::mask2cidr instead of netmask_to_cidr [puppet] - 10https://gerrit.wikimedia.org/r/1218710 (https://phabricator.wikimedia.org/T336864) (owner: 10Jelto) [09:39:31] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: use real netmask in interface::alias on replicas [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [09:40:05] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [09:41:02] (03PS2) 10Fabfur: P::cache::haproxy: enable QOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) [09:42:01] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [09:43:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [09:46:17] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [09:46:48] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11463122 (10ops-monitoring-bot) Host gitlab2002.wikimedia.org rebooted by jelto@cumin1003 with reason: maintenance reboot for new... [09:47:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11463124 (10ayounsi) @Jhancock.wm I'll leave it to you and @RobH to procure the needed equipment. If you prefer a fiber run between the two devi... [09:50:03] RESOLVED: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:52:07] (03PS1) 10Muehlenhoff: Remove puppetmaster::backend role from puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/1218716 (https://phabricator.wikimedia.org/T365798) [09:53:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218716 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:54:09] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [09:55:20] (03CR) 10Hashar: [C:03+2] "The API tests job failed with:" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [09:55:21] (03PS2) 10Elukey: sre.hosts.provision: fix retry logic for the Supermicro BMC password [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) [09:55:50] (03CR) 10Elukey: "Simplified even more the code, I think that now it looks way better." [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) (owner: 10Elukey) [09:58:20] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [09:58:53] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11463177 (10ops-monitoring-bot) Host gitlab1003.wikimedia.org rebooted by jelto@cumin1003 with reason: maintenance reboot for new... [09:59:37] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.7 [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [10:01:43] (03PS1) 10Tchanders: Add Special:GlobalContributions to no-IP reveal pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218719 (https://phabricator.wikimedia.org/T412530) [10:03:05] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [10:03:08] !log Started MediaWiki train task `train-presync`. It did not run overnight due to a CI failure | T408277 [10:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:12] T408277: 1.46.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T408277 [10:03:45] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218720 (https://phabricator.wikimedia.org/T408277) [10:03:47] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218720 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [10:04:38] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218720 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [10:04:45] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [10:05:03] FIRING: [2x] JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:05:07] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.7 refs T408277 [10:05:17] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie [10:08:26] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11463212 (10Jelto) `gitlab2002` and `gitlab1003` have been fixed using the changes above. Before merging the change I manually de... [10:10:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:12:06] (03PS1) 10Jelto: gitlab: use real netmask in interface::alias on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/1218723 (https://phabricator.wikimedia.org/T370018) [10:15:03] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1218723 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [10:15:47] (03CR) 10Jelto: [V:03+1 C:04-1] "merge after end of year break" [puppet] - 10https://gerrit.wikimedia.org/r/1218723 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [10:21:50] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::backend role from puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/1218716 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:24:32] (03PS4) 10Tiziano Fogli: icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) [10:25:02] (03CR) 10Cathal Mooney: [C:03+1] "Cool, LGTM! If we roll it out for those hosts we can take a look and see the matches on the network. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [10:25:05] (03PS1) 10Hashar: admin: hashar: disable screen startup message [puppet] - 10https://gerrit.wikimedia.org/r/1218725 [10:25:05] (03PS1) 10Hashar: admin: hashar: prepend screen name in PS1 [puppet] - 10https://gerrit.wikimedia.org/r/1218726 [10:25:06] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PuppetPendingCertificateRequest (instance puppetserver1001:9100) - https://phabricator.wikimedia.org/T412789 (10LSobanski) 03NEW [10:26:40] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:27:02] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:31:56] (03PS1) 10Muehlenhoff: puppetdb: Remove access for puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/1218729 (https://phabricator.wikimedia.org/T365798) [10:32:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11463290 (10cmooney) >>! In T410717#11463123, @ayounsi wrote: > If a copper run is fine, then it's an SFP-T (that you probably have in stock) on... [10:32:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2028.codfw.wmnet with reason: reimage [10:34:18] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [10:35:36] (03CR) 10Arnaudb: [C:03+2] admin: hashar: disable screen startup message [puppet] - 10https://gerrit.wikimedia.org/r/1218725 (owner: 10Hashar) [10:35:47] (03CR) 10Arnaudb: [C:03+2] admin: hashar: prepend screen name in PS1 [puppet] - 10https://gerrit.wikimedia.org/r/1218726 (owner: 10Hashar) [10:37:38] (03CR) 10Muehlenhoff: [C:03+2] puppetdb: Remove access for puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/1218729 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:39:34] (03PS5) 10Tiziano Fogli: icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) [10:40:03] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:42:39] (03PS1) 10Elukey: DNM - Reimage: manual stop before reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/1218731 [10:44:20] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [10:44:48] jouncebot: nowandnext [10:44:48] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [10:44:48] In 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1100) [10:45:38] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie [10:45:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218342 (https://phabricator.wikimedia.org/T379086) (owner: 10Dreamy Jazz) [10:45:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218343 (https://phabricator.wikimedia.org/T379087) (owner: 10Dreamy Jazz) [10:46:15] (03CR) 10Tiziano Fogli: [C:03+2] icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:46:17] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [10:46:37] (03Merged) 10jenkins-bot: Remove definition of wgGlobalBlockingEnableAutoblocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218342 (https://phabricator.wikimedia.org/T379086) (owner: 10Dreamy Jazz) [10:46:39] (03Merged) 10jenkins-bot: Show global autoblocks in the globalblocks list API response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218343 (https://phabricator.wikimedia.org/T379087) (owner: 10Dreamy Jazz) [10:49:44] Scap is currently being held by "concurrent prep is locked by mwpresync on Tue Dec 16 10:05:07 2025; reason is "testwikis to 1.46.0-wmf.7 refs T408277"" [10:49:45] T408277: 1.46.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T408277 [10:50:19] My understanding is that it normally doesn't take more than a few minutes to move testwikis to the new wiki version, so is there something delaying it? [10:51:29] Dreamy_Jazz: https://sal.toolforge.org/log/VL-dJpsBffdvpiTrGlEr [10:51:49] I am rerunning it yes [10:51:50] concurrent prep is locked by mwpresync (pid 1347261) on Tue Dec 16 10:05:07 2025; reason is "testwikis to 1.46.0-wmf.7 refs T408277". [10:51:50] Will wait up to 10 minute(s) for the lock(s) to be released [10:52:01] I had presumed it finished [10:52:23] (or at least it wasn't actively happening because the window seemed free) [10:52:25] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11463359 (10MoritzMuehlenhoff) [10:52:32] it takes a couple hours to run iirc [10:52:39] err [10:52:42] at least an hour [10:53:18] I have started it with `sudo /bin/systemctl start train-presync` [10:53:34] Okay. My config patches were already merged as it seems that the command above doesn't block off scap entirely [10:53:54] I presume the spiderpig job will exit and then at some point later I'll try syncing again [10:54:02] the last entry I had in the log was images being build with output being logged to /srv/mwpresync/scap-image-build-and-push-log [10:54:26] I have been tailing that file and it is at: [10:54:26] 10:09:23 [mediawiki-publish-83] Running sudo /usr/local/bin/docker-pusher -q docker-registry.discovery.wmnet/.. [10:55:20] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=deploy2002&var-datasource=000000026&var-cluster=misc&from=now-3h&to=now&timezone=utc&viewPanel=panel-8 [10:55:37] it is pushing stuff oscillating between 3MB/s and 5MB/s [10:56:04] Yeah, thanks for the graph [10:56:35] the image was created 46 minutes ago and is 9.23GB [10:57:16] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie [10:57:19] so there is some network bottleneck either out of deployment box or to ingree traffic on the image registry [10:57:59] Yeah, at the slower speed it seems about an hour using some back of the hand math [10:58:05] !log mwpresync@deploy2002 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.5,1.46.0-wmf.7,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/m [10:58:05] ediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.229.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/media [10:58:06] wiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.229.0) (duration: 52m 58s) [10:58:11] But that's presuming the file needs t be copied once [10:58:17] 10:58:05 [mediawiki-publish-83] received unexpected HTTP status: 500 Internal Server Error [10:58:17] :-( [10:58:22] :( [10:58:25] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [10:58:53] ���� ��� DOCKER [10:59:06] (03PS1) 10Gmodena: alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) [10:59:08] I presume you are going to retry? [10:59:23] go ahead and backport your patch :] [10:59:25] (03CR) 10MVernon: "This looks plausible to me; when it comes to deployment, do we want to merge this on a depooled proxy first to check all is good, or are y" [puppet] - 10https://gerrit.wikimedia.org/r/1218347 (owner: 10Cathal Mooney) [10:59:35] (03CR) 10CI reject: [V:04-1] alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena) [10:59:36] I am going to brew a coffee and will resume the train sync once you are done [10:59:45] Okay. Backporting now. Thanks [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1100) [11:00:44] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1218342|Remove definition of wgGlobalBlockingEnableAutoblocks (T379086)]], [[gerrit:1218343|Show global autoblocks in the globalblocks list API response (T379087)]] [11:00:49] T379086: Remove wgGlobalBlockingEnableAutoblocks - https://phabricator.wikimedia.org/T379086 [11:00:49] T379087: Remove wgGlobalBlockingHideAutoblocksInGlobalBlocksAPIResponse - https://phabricator.wikimedia.org/T379087 [11:01:17] (03PS2) 10Gmodena: alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) [11:04:11] !log urbanecm@deploy2002 mwscript-k8s job started: GrowthExperiments:fixLinkRecommendationData --wiki=itwiki --dry-run --search-index --db-table # T412040-fix-dryrun-02 [11:04:15] T412040: Add a Link: repopulate "Add a Link" suggestions for itwiki - https://phabricator.wikimedia.org/T412040 [11:06:46] (03PS3) 10Elukey: Pyrra: add the MWH completeness SLO under Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) [11:07:36] * hashar grabs a coffee [11:10:23] k8s image build and push is taking longer than normal which is unexpected because my config patches did not affect i18n. I expect this is because the last push as part of the mwpresync failed? [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:11:41] I wonder if the same speed restrictions is being seen for this build? [11:15:29] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:18:00] (03CR) 10Marco Fossati: [C:03+1] Decommission Article Summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217799 (https://phabricator.wikimedia.org/T411558) (owner: 10Kimberly Sarabia) [11:18:19] Dreamy_Jazz: oh yeah my bad sorry [11:18:29] I imagine scap might indeed attempt to push the images :/ [11:18:33] (03CR) 10Btullis: postgresql-airflow-main: Increase pgbouncer pool size (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990) (owner: 10Aqu) [11:18:42] I am dumb I forgot :/ [11:19:01] Yeah the build-and-push-log last has an entry at 11:02 [11:19:18] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=deploy2002&var-datasource=000000026&var-cluster=misc&from=now-1h&to=now&timezone=utc&viewPanel=panel-8 [11:19:32] so yeah sorry I have passed to you the hot potatoe of pushing stuff [11:19:34] :-\ [11:19:35] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [11:19:39] Yeah, been watching that graph and seeing it do the same thing :D [11:20:01] and I could not manage to find out how to reach the logs for that `docker push` [11:20:45] It kind of feels like the maximum speed is lower than previous attempts to push [11:21:35] (03Abandoned) 10Effie Mouzeli: sre.k8s.pool-depool-node: additional check for control nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1211016 (owner: 10Effie Mouzeli) [11:22:57] https://grafana.wikimedia.org/goto/cGn-4EGDR?orgId=1 shows to me that last weeks presync went much faster (assuming that is what the activity at 04:30 is) [11:22:58] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie [11:29:56] (03PS1) 10Elukey: admin_ng: bump kartotherian's cpu quotas to have smoother deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218737 [11:32:03] the slow times may be related to pushing the image layers to swift, we should really start trying the ceph-based backend for /restricted [11:32:04] Dreamy_Jazz: it usually takes 45 minutes based on https://sal.toolforge.org/production?p=0&q=%22Finished+scap+sync-world%3A+testwikis%22&d= [11:32:31] but it will need more tests, so something not immediate :( [11:33:11] Thanks for the context. I have time to wait and monitor this proceed [11:35:04] elukey@cumin1003 reimage (PID 1159643) is awaiting input [11:39:43] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS trixie [11:40:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.provision for host es2028.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:41:53] Flurry of activity in `/var/lib/spiderpig/scap-image-build-and-push-log` [11:42:22] The push-and-build completed successfully, it's now on to the sync-masters step [11:43:07] jouncebot: nowandnext [11:43:07] For the next 0 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1100) [11:43:07] In 1 hour(s) and 16 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1300) [11:43:36] (03PS1) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) [11:43:39] sync-master is going slower than normal, likely because it needs to copy more data like a i18n backport [11:44:31] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [11:44:44] (03CR) 10Cathal Mooney: "Thanks Matthew. I'm 99% sure it'll "Just Work Fine"TM. But similarly if it's easy to depool a host and apply it there first I'd say let'" [puppet] - 10https://gerrit.wikimedia.org/r/1218347 (owner: 10Cathal Mooney) [11:46:28] (03PS2) 10Ayounsi: [WIP] Capirca: only show diff when running in "non-commit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218209 (https://phabricator.wikimedia.org/T361549)