[00:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0000)
[00:00:46] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[00:03:01] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Replace backtick operator with shell_exec [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1218364 (owner: 10Pppery)
[00:10:15] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[00:10:19] <wikibugs>	 (03PS1) 10Dzahn: Revert "deployment::server: allow releases hosts encrypted rsync" [puppet] - 10https://gerrit.wikimedia.org/r/1218383
[00:12:19] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "deployment::server: allow releases hosts encrypted rsync" [puppet] - 10https://gerrit.wikimedia.org/r/1218383 (owner: 10Dzahn)
[00:22:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 18.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:27:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:33:17] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:38:50] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] {api,rest}-gateway: Update to Envoy 1.35.7 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217611 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus)
[00:40:06] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218385
[00:40:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218385 (owner: 10TrainBranchBot)
[00:41:57] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[00:42:13] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[00:43:17] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:43:52] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[00:44:16] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[00:45:54] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[00:46:17] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[00:48:17] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:48:52] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[00:49:08] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[00:50:15] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[00:50:30] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[00:52:49] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[00:52:50] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218385 (owner: 10TrainBranchBot)
[00:53:01] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[01:00:39] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:01:54] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 01m 15s)
[01:10:05] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218390
[01:10:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218390 (owner: 10TrainBranchBot)
[01:10:07] <wikibugs>	 06SRE, 06collaboration-services, 10MW-on-K8s, 06serviceops: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858#11462618 (10Dzahn) 05Open→03Resolved file transfers to and between releases servers are now encrypted
[01:11:27] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wiki Debates - https://phabricator.wikimedia.org/T412017#11462622 (10Dzahn) Hi @Gnangarra what do you think? Do you just want to take over the existing Wikidebate list?
[01:13:07] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 35357912 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:14:07] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3199168 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:31:13] <icinga-wm>	 PROBLEM - dump of matomo in eqiad on backupmon1001 is CRITICAL: Last dump for matomo at eqiad (db1208) taken on 2025-12-16 01:07:57 is 436 MiB, but the previous one was 537 MiB, a change of -18.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:34:40] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218390 (owner: 10TrainBranchBot)
[01:55:36] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] SpecialLinkSearch: Add a message when domains are being ignored [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218295 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup)
[02:06:12] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86634 and previous config saved to /var/cache/conftool/dbconfig/20251216-020611-marostegui.json
[02:06:17] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[02:06:18] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[02:07:29] <wikibugs>	 (03Merged) 10jenkins-bot: SpecialLinkSearch: Add a message when domains are being ignored [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218295 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup)
[02:07:58] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] SpecialLinkSearch: Move ignored-domains msg to bottom [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218314 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup)
[02:09:33] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.7 [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277)
[02:09:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.7 [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot)
[02:09:35] <wikibugs>	 (03CR) 10Ladsgroup: "With the cherry-pick, it doesn't move the message, it adds to to bottom too :/" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218314 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup)
[02:09:45] <wikibugs>	 (03Abandoned) 10Ladsgroup: SpecialLinkSearch: Move ignored-domains msg to bottom [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218314 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup)
[02:11:20] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1218295|SpecialLinkSearch: Add a message when domains are being ignored (T405005)]]
[02:11:24] <stashbot>	 T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005
[02:21:19] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P86635 and previous config saved to /var/cache/conftool/dbconfig/20251216-022119-marostegui.json
[02:21:28] <wikibugs>	 (03PS1) 10Clare Ming: Update references to `product_metrics` to `test_kitchen` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218395 (https://phabricator.wikimedia.org/T407906)
[02:22:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/1.46.0-wmf.7 [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot)
[02:27:02] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[02:36:28] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P86636 and previous config saved to /var/cache/conftool/dbconfig/20251216-023627-marostegui.json
[02:36:39] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1218295|SpecialLinkSearch: Add a message when domains are being ignored (T405005)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[02:36:44] <stashbot>	 T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005
[02:37:31] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[02:50:07] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218295|SpecialLinkSearch: Add a message when domains are being ignored (T405005)]] (duration: 38m 47s)
[02:50:11] <stashbot>	 T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005
[02:51:36] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86637 and previous config saved to /var/cache/conftool/dbconfig/20251216-025136-marostegui.json
[02:51:42] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[02:51:44] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[02:51:52] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance
[02:52:01] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2168 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86638 and previous config saved to /var/cache/conftool/dbconfig/20251216-025200-marostegui.json
[02:53:07] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 319283440 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:54:07] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 23752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:55:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0300)
[03:02:03] <wikibugs>	 (03CR) 10Clare Ming: [C:04-2] "need to wait until https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/226 propagates everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[03:18:15] <icinga-wm>	 PROBLEM - Host an-druid1005 is DOWN: PING CRITICAL - Packet loss = 20%, RTA = 2746.82 ms
[03:18:55] <icinga-wm>	 RECOVERY - Host an-druid1005 is UP: PING OK - Packet loss = 0%, RTA = 21.67 ms
[03:39:31] <wikibugs>	 (03CR) 10Clare Ming: "not sure if we want to update stream names with `product_metrics` in them or not" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218395 (https://phabricator.wikimedia.org/T407906) (owner: 10Clare Ming)
[03:50:46] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[04:00:05] <jouncebot>	 Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0400)
[04:10:15] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[04:15:46] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[04:47:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:48:32] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:52:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[05:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0500)
[05:10:03] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:15:59] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[05:16:07] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86639 and previous config saved to /var/cache/conftool/dbconfig/20251216-051607-marostegui.json
[05:16:13] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[05:16:13] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[05:35:03] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:27:02] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:35:25] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86640 and previous config saved to /var/cache/conftool/dbconfig/20251216-063525-marostegui.json
[06:35:31] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[06:35:32] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[06:50:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P86641 and previous config saved to /var/cache/conftool/dbconfig/20251216-065033-marostegui.json
[06:55:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:55:39] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:58:33] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0700)
[07:00:04] <jouncebot>	 marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0700).
[07:05:42] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P86642 and previous config saved to /var/cache/conftool/dbconfig/20251216-070542-marostegui.json
[07:10:13] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755#11462838 (10Marostegui) p:05Triage→03Medium a:03CDobbins I assume you'd take care of this yourself? If you need help from Clinic Duty person let me know!
[07:18:24] <wikibugs>	 (03PS1) 10Marostegui: isntallserver: Do not format /srv on es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1218652 (https://phabricator.wikimedia.org/T407472)
[07:20:50] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86643 and previous config saved to /var/cache/conftool/dbconfig/20251216-072049-marostegui.json
[07:20:55] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[07:20:55] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[07:21:05] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] isntallserver: Do not format /srv on es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1218652 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui)
[07:21:06] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[07:21:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1181 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86644 and previous config saved to /var/cache/conftool/dbconfig/20251216-072114-marostegui.json
[07:22:36] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11462865 (10ABran-WMF) a:03ABran-WMF
[07:27:55] <wikibugs>	 (03Abandoned) 10Ayounsi: interface::alias: update define to get prefix len from netmask [puppet] - 10https://gerrit.wikimedia.org/r/931237 (https://phabricator.wikimedia.org/T336864) (owner: 10Jbond)
[07:30:27] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans)
[07:33:08] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Netbox scripts: remove the scheduling UI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218210 (owner: 10Ayounsi)
[07:34:58] <wikibugs>	 (03Merged) 10jenkins-bot: Netbox scripts: remove the scheduling UI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218210 (owner: 10Ayounsi)
[07:36:54] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[07:37:24] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[07:47:28] <wikibugs>	 (03CR) 10Itamar Givon: [C:03+1] Use relative path for "latest" symlinks [dumps] - 10https://gerrit.wikimedia.org/r/1218317 (https://phabricator.wikimedia.org/T412726) (owner: 10Jakob)
[07:59:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/1215549 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0800).
[08:00:05] <jouncebot>	 hamishcz and akosiaris: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:02:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86645 and previous config saved to /var/cache/conftool/dbconfig/20251216-080227-marostegui.json
[08:02:33] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[08:02:34] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[08:09:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetmaster::backend role from puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/1218704 (https://phabricator.wikimedia.org/T365798)
[08:10:15] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[08:11:59] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS bookworm
[08:17:36] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P86646 and previous config saved to /var/cache/conftool/dbconfig/20251216-081735-marostegui.json
[08:22:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::backend role from puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/1218704 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[08:27:56] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans)
[08:29:37] <wikibugs>	 (03PS2) 10Muehlenhoff: Add Hugh as approver for mw-log-readers and logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214078 (https://phabricator.wikimedia.org/T276465)
[08:32:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P86647 and previous config saved to /var/cache/conftool/dbconfig/20251216-083243-marostegui.json
[08:33:36] <wikibugs>	 (03PS8) 10Dpogorzelski: ml-build: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1218211
[08:33:44] <wikibugs>	 (03CR) 10Dpogorzelski: ml-build: add docker-pkg (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski)
[08:37:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetmaster2002 from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1218706 (https://phabricator.wikimedia.org/T365798)
[08:39:32] <wikibugs>	 (03PS1) 10Dpogorzelski: docker_registry: allow ml-build1001 [puppet] - 10https://gerrit.wikimedia.org/r/1218707 (https://phabricator.wikimedia.org/T412524)
[08:40:55] <wikibugs>	 (03PS1) 10Jelto: gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018)
[08:41:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto)
[08:42:59] <wikibugs>	 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for ryankemper - https://phabricator.wikimedia.org/T412126#11462957 (10MoritzMuehlenhoff)
[08:43:11] <wikibugs>	 (03PS2) 10Jelto: gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018)
[08:45:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto)
[08:45:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster2002 from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1218706 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[08:46:34] <wikibugs>	 (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski)
[08:47:53] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86648 and previous config saved to /var/cache/conftool/dbconfig/20251216-084752-marostegui.json
[08:47:58] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[08:47:59] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[08:48:09] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance
[08:48:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2182 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86649 and previous config saved to /var/cache/conftool/dbconfig/20251216-084817-marostegui.json
[08:48:32] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:50:37] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11462983 (10ayounsi)
[08:51:56] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T410589)', diff saved to https://phabricator.wikimedia.org/P86650 and previous config saved to /var/cache/conftool/dbconfig/20251216-085155-ladsgroup.json
[08:52:00] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[08:53:28] <wikibugs>	 (03PS1) 10Aqu: postgresql-airflow-main: Increase pgbouncer pool size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990)
[08:55:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetmaster2002.codfw.wmnet
[08:58:05] <wikibugs>	 (03PS1) 10Jelto: interface::alias: use wmflib::mask2cidr instead of netmask_to_cidr [puppet] - 10https://gerrit.wikimedia.org/r/1218710 (https://phabricator.wikimedia.org/T336864)
[08:58:46] <wikibugs>	 (03PS3) 10Jelto: gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018)
[09:00:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#11463009 (10MoritzMuehlenhoff)
[09:01:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:02:02] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7825/co" [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto)
[09:04:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[09:06:03] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7826/console" [puppet] - 10https://gerrit.wikimedia.org/r/1218710 (https://phabricator.wikimedia.org/T336864) (owner: 10Jelto)
[09:06:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetmaster2002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1218711 (https://phabricator.wikimedia.org/T412783)
[09:07:04] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P86651 and previous config saved to /var/cache/conftool/dbconfig/20251216-090704-ladsgroup.json
[09:07:57] <logmsgbot>	 jmm@cumin2002 decommission (PID 2673345) is awaiting input
[09:12:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[09:12:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:12:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster2002.codfw.wmnet
[09:12:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#11463027 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `puppetmaster2002.codfw.wmnet` - puppetmaster2002....
[09:12:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster2002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1218711 (https://phabricator.wikimedia.org/T412783) (owner: 10Muehlenhoff)
[09:13:51] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission puppetmaster2002 - https://phabricator.wikimedia.org/T412783#11463029 (10MoritzMuehlenhoff)
[09:19:11] <wikibugs>	 (03CR) 10Elukey: [C:03+2] team-sre: avoid cert-expiry alerts for staging endpoints [alerts] - 10https://gerrit.wikimedia.org/r/1217107 (owner: 10Elukey)
[09:20:37] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add primary IP to ps1-e10-eqiad - ayounsi@cumin1003"
[09:21:17] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add primary IP to ps1-e10-eqiad - ayounsi@cumin1003"
[09:22:00] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[09:22:13] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P86652 and previous config saved to /var/cache/conftool/dbconfig/20251216-092212-ladsgroup.json
[09:27:18] <wikibugs>	 (03PS2) 10Elukey: Pyrra: add the MWH completeness SLO under Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892)
[09:27:28] <wikibugs>	 (03CR) 10Elukey: Pyrra: add the MWH completeness SLO under Data Platform (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) (owner: 10Elukey)
[09:28:18] <wikibugs>	 (03PS4) 10Jelto: gitlab: use real netmask in interface::alias on replicas [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018)
[09:32:05] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto)
[09:32:42] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove puppetmaster1003 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1215202 (https://phabricator.wikimedia.org/T365798)
[09:34:13] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "nicely done!" [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto)
[09:34:37] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm! especially as it's a NOOP for now." [puppet] - 10https://gerrit.wikimedia.org/r/1218710 (https://phabricator.wikimedia.org/T336864) (owner: 10Jelto)
[09:35:10] <wikibugs>	 (03PS1) 10Fabfur: P::cache::haproxy: enable QOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785)
[09:37:12] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur)
[09:37:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster1003 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1215202 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[09:37:21] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T410589)', diff saved to https://phabricator.wikimedia.org/P86653 and previous config saved to /var/cache/conftool/dbconfig/20251216-093720-ladsgroup.json
[09:37:25] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[09:37:38] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2209.codfw.wmnet with reason: Maintenance
[09:37:46] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T410589)', diff saved to https://phabricator.wikimedia.org/P86654 and previous config saved to /var/cache/conftool/dbconfig/20251216-093745-ladsgroup.json
[09:39:25] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] interface::alias: use wmflib::mask2cidr instead of netmask_to_cidr [puppet] - 10https://gerrit.wikimedia.org/r/1218710 (https://phabricator.wikimedia.org/T336864) (owner: 10Jelto)
[09:39:31] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: use real netmask in interface::alias on replicas [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto)
[09:40:05] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[09:41:02] <wikibugs>	 (03PS2) 10Fabfur: P::cache::haproxy: enable QOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785)
[09:42:01] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur)
[09:43:58] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[09:46:17] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org
[09:46:48] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11463122 (10ops-monitoring-bot) Host gitlab2002.wikimedia.org rebooted by jelto@cumin1003 with reason: maintenance reboot for new...
[09:47:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11463124 (10ayounsi) @Jhancock.wm I'll leave it to you and @RobH to procure the needed equipment.  If you prefer a fiber run between the two devi...
[09:50:03] <jinxer-wm>	 RESOLVED: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:52:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetmaster::backend role from puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/1218716 (https://phabricator.wikimedia.org/T365798)
[09:53:58] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218716 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[09:54:09] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org
[09:55:20] <wikibugs>	 (03CR) 10Hashar: [C:03+2] "The API tests job failed with:" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot)
[09:55:21] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.provision: fix retry logic for the Supermicro BMC password [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458)
[09:55:50] <wikibugs>	 (03CR) 10Elukey: "Simplified even more the code, I think that now it looks way better." [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) (owner: 10Elukey)
[09:58:20] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org
[09:58:53] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11463177 (10ops-monitoring-bot) Host gitlab1003.wikimedia.org rebooted by jelto@cumin1003 with reason: maintenance reboot for new...
[09:59:37] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.7 [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot)
[10:01:43] <wikibugs>	 (03PS1) 10Tchanders: Add Special:GlobalContributions to no-IP reveal pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218719 (https://phabricator.wikimedia.org/T412530)
[10:03:05] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur)
[10:03:08] <hashar>	 !log Started MediaWiki train task `train-presync`. It did not run overnight due to a CI failure | T408277
[10:03:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:12] <stashbot>	 T408277: 1.46.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T408277
[10:03:45] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218720 (https://phabricator.wikimedia.org/T408277)
[10:03:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218720 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot)
[10:04:38] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218720 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot)
[10:04:45] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org
[10:05:03] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:05:07] <logmsgbot>	 !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.7  refs T408277
[10:05:17] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie
[10:08:26] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11463212 (10Jelto) `gitlab2002` and `gitlab1003` have been fixed using the changes above. Before merging the change I manually de...
[10:10:03] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:12:06] <wikibugs>	 (03PS1) 10Jelto: gitlab: use real netmask in interface::alias on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/1218723 (https://phabricator.wikimedia.org/T370018)
[10:15:03] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1218723 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto)
[10:15:47] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:04-1] "merge after end of year break" [puppet] - 10https://gerrit.wikimedia.org/r/1218723 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto)
[10:21:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::backend role from puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/1218716 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[10:24:32] <wikibugs>	 (03PS4) 10Tiziano Fogli: icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625)
[10:25:02] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Cool, LGTM!  If we roll it out for those hosts we can take a look and see the matches on the network.  Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur)
[10:25:05] <wikibugs>	 (03PS1) 10Hashar: admin: hashar: disable screen startup message [puppet] - 10https://gerrit.wikimedia.org/r/1218725
[10:25:05] <wikibugs>	 (03PS1) 10Hashar: admin: hashar: prepend screen name in PS1 [puppet] - 10https://gerrit.wikimedia.org/r/1218726
[10:25:06] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PuppetPendingCertificateRequest (instance puppetserver1001:9100) - https://phabricator.wikimedia.org/T412789 (10LSobanski) 03NEW
[10:26:40] <wikibugs>	 (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[10:27:02] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[10:31:56] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetdb: Remove access for puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/1218729 (https://phabricator.wikimedia.org/T365798)
[10:32:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11463290 (10cmooney) >>! In T410717#11463123, @ayounsi wrote: > If a copper run is fine, then it's an SFP-T (that you probably have in stock) on...
[10:32:29] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2028.codfw.wmnet with reason: reimage
[10:34:18] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie
[10:35:36] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] admin: hashar: disable screen startup message [puppet] - 10https://gerrit.wikimedia.org/r/1218725 (owner: 10Hashar)
[10:35:47] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] admin: hashar: prepend screen name in PS1 [puppet] - 10https://gerrit.wikimedia.org/r/1218726 (owner: 10Hashar)
[10:37:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] puppetdb: Remove access for puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/1218729 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[10:39:34] <wikibugs>	 (03PS5) 10Tiziano Fogli: icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625)
[10:40:03] <wikibugs>	 (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[10:42:39] <wikibugs>	 (03PS1) 10Elukey: DNM - Reimage: manual stop before reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/1218731
[10:44:20] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[10:44:48] <Dreamy_Jazz>	 jouncebot: nowandnext
[10:44:48] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 15 minute(s)
[10:44:48] <jouncebot>	 In 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1100)
[10:45:38] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie
[10:45:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218342 (https://phabricator.wikimedia.org/T379086) (owner: 10Dreamy Jazz)
[10:45:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218343 (https://phabricator.wikimedia.org/T379087) (owner: 10Dreamy Jazz)
[10:46:15] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[10:46:17] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie
[10:46:37] <wikibugs>	 (03Merged) 10jenkins-bot: Remove definition of wgGlobalBlockingEnableAutoblocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218342 (https://phabricator.wikimedia.org/T379086) (owner: 10Dreamy Jazz)
[10:46:39] <wikibugs>	 (03Merged) 10jenkins-bot: Show global autoblocks in the globalblocks list API response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218343 (https://phabricator.wikimedia.org/T379087) (owner: 10Dreamy Jazz)
[10:49:44] <Dreamy_Jazz>	 Scap is currently being held by "concurrent prep is locked by mwpresync on Tue Dec 16 10:05:07 2025; reason is "testwikis to 1.46.0-wmf.7  refs T408277""
[10:49:45] <stashbot>	 T408277: 1.46.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T408277
[10:50:19] <Dreamy_Jazz>	 My understanding is that it normally doesn't take more than a few minutes to move testwikis to the new wiki version, so is there something delaying it?
[10:51:29] <taavi>	 Dreamy_Jazz: https://sal.toolforge.org/log/VL-dJpsBffdvpiTrGlEr
[10:51:49] <hashar>	 I am rerunning it yes
[10:51:50] <hashar>	 concurrent prep is locked by mwpresync (pid 1347261) on Tue Dec 16 10:05:07 2025; reason is "testwikis to 1.46.0-wmf.7  refs T408277".
[10:51:50] <hashar>	 Will wait up to 10 minute(s) for the lock(s) to be released
[10:52:01] <Dreamy_Jazz>	 I had presumed it finished
[10:52:23] <Dreamy_Jazz>	 (or at least it wasn't actively happening because the window seemed free)
[10:52:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11463359 (10MoritzMuehlenhoff)
[10:52:32] <hashar>	 it takes a couple hours to run iirc
[10:52:39] <hashar>	 err
[10:52:42] <hashar>	 at least an hour
[10:53:18] <hashar>	 I have started it with `sudo /bin/systemctl start train-presync`
[10:53:34] <Dreamy_Jazz>	 Okay. My config patches were already merged as it seems that the command above doesn't block off scap entirely
[10:53:54] <Dreamy_Jazz>	 I presume the spiderpig job will exit and then at some point later I'll try syncing again
[10:54:02] <hashar>	 the last entry I had in the log was images being build with output being logged to /srv/mwpresync/scap-image-build-and-push-log
[10:54:26] <hashar>	 I have been tailing that file and it is at:
[10:54:26] <hashar>	 10:09:23 [mediawiki-publish-83] Running sudo /usr/local/bin/docker-pusher -q docker-registry.discovery.wmnet/..
[10:55:20] <hashar>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=deploy2002&var-datasource=000000026&var-cluster=misc&from=now-3h&to=now&timezone=utc&viewPanel=panel-8
[10:55:37] <hashar>	 it is pushing stuff oscillating between 3MB/s and 5MB/s
[10:56:04] <Dreamy_Jazz>	 Yeah, thanks for the graph
[10:56:35] <hashar>	 the image was created 46 minutes ago and is 9.23GB
[10:57:16] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie
[10:57:19] <hashar>	 so there is some network bottleneck either out of deployment box or to ingree traffic on the image registry
[10:57:59] <Dreamy_Jazz>	 Yeah, at the slower speed it seems about an hour using some back of the hand math
[10:58:05] <logmsgbot>	 !log mwpresync@deploy2002 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.5,1.46.0-wmf.7,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/m
[10:58:05] <logmsgbot>	 ediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.229.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/media
[10:58:06] <logmsgbot>	 wiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.229.0) (duration: 52m 58s)
[10:58:11] <Dreamy_Jazz>	 But that's presuming the file needs t be copied once
[10:58:17] <hashar>	 10:58:05 [mediawiki-publish-83] received unexpected HTTP status: 500 Internal Server Error
[10:58:17] <hashar>	 :-(
[10:58:22] <Dreamy_Jazz>	 :(
[10:58:25] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[10:58:53] <hashar>	 ���� ��� DOCKER
[10:59:06] <wikibugs>	 (03PS1) 10Gmodena: alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782)
[10:59:08] <Dreamy_Jazz>	 I presume you are going to retry?
[10:59:23] <hashar>	 go ahead and backport your patch :]
[10:59:25] <wikibugs>	 (03CR) 10MVernon: "This looks plausible to me; when it comes to deployment, do we want to merge this on a depooled proxy first to check all is good, or are y" [puppet] - 10https://gerrit.wikimedia.org/r/1218347 (owner: 10Cathal Mooney)
[10:59:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena)
[10:59:36] <hashar>	 I am going to brew a coffee and will resume the train sync once you are done
[10:59:45] <Dreamy_Jazz>	 Okay. Backporting now. Thanks
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1100)
[11:00:44] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1218342|Remove definition of wgGlobalBlockingEnableAutoblocks (T379086)]], [[gerrit:1218343|Show global autoblocks in the globalblocks list API response (T379087)]]
[11:00:49] <stashbot>	 T379086: Remove wgGlobalBlockingEnableAutoblocks - https://phabricator.wikimedia.org/T379086
[11:00:49] <stashbot>	 T379087: Remove wgGlobalBlockingHideAutoblocksInGlobalBlocksAPIResponse - https://phabricator.wikimedia.org/T379087
[11:01:17] <wikibugs>	 (03PS2) 10Gmodena: alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782)
[11:04:11] <logmsgbot>	 !log urbanecm@deploy2002 mwscript-k8s job started: GrowthExperiments:fixLinkRecommendationData --wiki=itwiki --dry-run --search-index --db-table  # T412040-fix-dryrun-02
[11:04:15] <stashbot>	 T412040: Add a Link: repopulate "Add a Link" suggestions for itwiki - https://phabricator.wikimedia.org/T412040
[11:06:46] <wikibugs>	 (03PS3) 10Elukey: Pyrra: add the MWH completeness SLO under Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892)
[11:07:36] * hashar grabs a coffee
[11:10:23] <Dreamy_Jazz>	 k8s image build and push is taking longer than normal which is unexpected because my config patches did not affect i18n. I expect this is because the last push as part of the mwpresync failed?
[11:11:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:11:41] <Dreamy_Jazz>	 I wonder if the same speed restrictions is being seen for this build?
[11:15:29] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[11:16:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:18:00] <wikibugs>	 (03CR) 10Marco Fossati: [C:03+1] Decommission Article Summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217799 (https://phabricator.wikimedia.org/T411558) (owner: 10Kimberly Sarabia)
[11:18:19] <hashar>	 Dreamy_Jazz: oh yeah my bad sorry
[11:18:29] <hashar>	 I imagine scap might indeed attempt to push the images :/
[11:18:33] <wikibugs>	 (03CR) 10Btullis: postgresql-airflow-main: Increase pgbouncer pool size (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990) (owner: 10Aqu)
[11:18:42] <hashar>	 I am dumb I forgot :/
[11:19:01] <Dreamy_Jazz>	 Yeah the build-and-push-log last has an entry at 11:02
[11:19:18] <hashar>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=deploy2002&var-datasource=000000026&var-cluster=misc&from=now-1h&to=now&timezone=utc&viewPanel=panel-8
[11:19:32] <hashar>	 so yeah sorry I have passed to you the hot potatoe of pushing stuff
[11:19:34] <hashar>	 :-\
[11:19:35] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[11:19:39] <Dreamy_Jazz>	 Yeah, been watching that graph and seeing it do the same thing :D
[11:20:01] <hashar>	 and I could not manage to find out how to reach the logs for that `docker push`
[11:20:45] <Dreamy_Jazz>	 It kind of feels like the maximum speed is lower than previous attempts to push
[11:21:35] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: sre.k8s.pool-depool-node: additional check for control nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1211016 (owner: 10Effie Mouzeli)
[11:22:57] <Dreamy_Jazz>	 https://grafana.wikimedia.org/goto/cGn-4EGDR?orgId=1 shows to me that last weeks presync went much faster (assuming that is what the activity at 04:30 is)
[11:22:58] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie
[11:29:56] <wikibugs>	 (03PS1) 10Elukey: admin_ng: bump kartotherian's cpu quotas to have smoother deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218737
[11:32:03] <elukey>	 the slow times may be related to pushing the image layers to swift, we should really start trying the ceph-based backend for /restricted
[11:32:04] <hashar>	 Dreamy_Jazz: it usually takes 45 minutes based on https://sal.toolforge.org/production?p=0&q=%22Finished+scap+sync-world%3A+testwikis%22&d=
[11:32:31] <elukey>	 but it will need more tests, so something not immediate :(
[11:33:11] <Dreamy_Jazz>	 Thanks for the context. I have time to wait and monitor this proceed
[11:35:04] <logmsgbot>	 elukey@cumin1003 reimage (PID 1159643) is awaiting input
[11:39:43] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS trixie
[11:40:38] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.provision for host es2028.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[11:41:53] <Dreamy_Jazz>	 Flurry of activity in `/var/lib/spiderpig/scap-image-build-and-push-log`
[11:42:22] <Dreamy_Jazz>	 The push-and-build completed successfully, it's now on to the sync-masters step
[11:43:07] <Dreamy_Jazz>	 jouncebot: nowandnext
[11:43:07] <jouncebot>	 For the next 0 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1100)
[11:43:07] <jouncebot>	 In 1 hour(s) and 16 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1300)
[11:43:36] <wikibugs>	 (03PS1) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549)
[11:43:39] <Dreamy_Jazz>	 sync-master is going slower than normal, likely because it needs to copy more data like a i18n backport
[11:44:31] <icinga-wm>	 PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[11:44:44] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks Matthew.  I'm 99% sure it'll "Just Work Fine"TM.  But similarly if it's easy to depool a host and apply it there first I'd say let'" [puppet] - 10https://gerrit.wikimedia.org/r/1218347 (owner: 10Cathal Mooney)
[11:46:28] <wikibugs>	 (03PS2) 10Ayounsi: [WIP] Capirca: only show diff when running in "non-commit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218209 (https://phabricator.wikimedia.org/T361549)
[11:50:52] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1218342|Remove definition of wgGlobalBlockingEnableAutoblocks (T379086)]], [[gerrit:1218343|Show global autoblocks in the globalblocks list API response (T379087)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[11:50:58] <stashbot>	 T379086: Remove wgGlobalBlockingEnableAutoblocks - https://phabricator.wikimedia.org/T379086
[11:50:58] <stashbot>	 T379087: Remove wgGlobalBlockingHideAutoblocksInGlobalBlocksAPIResponse - https://phabricator.wikimedia.org/T379087
[11:51:00] <wikibugs>	 (03PS1) 10Gmodena: wdqs: register blazegraph with wikidata platform [alerts] - 10https://gerrit.wikimedia.org/r/1218740 (https://phabricator.wikimedia.org/T412782)
[11:51:20] <wikibugs>	 (03CR) 10Ayounsi: "Tested in Netbox-next" [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi)
[11:51:50] <wikibugs>	 (03PS3) 10Ayounsi: Capirca: only show diff when running in "non-commit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218209 (https://phabricator.wikimedia.org/T361549)
[11:54:01] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[11:54:31] <icinga-wm>	 RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[11:55:00] <Dreamy_Jazz>	 hashar: It seems my scap backport has also synced testwikis to wmf.7 based on https://versions.toolforge.org/?
[11:56:02] <Dreamy_Jazz>	 Yeah https://test.wikipedia.org/wiki/Special:Version on the debug servers says wmf.7 and not on the debug servers says wmf.5
[11:56:55] <Dreamy_Jazz>	 So I guess the train should be synced to the testwikis by this change and nothing else would be needed. I can ping you when I'm done if you want to check?
[11:59:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi)
[12:03:40] <wikibugs>	 (03CR) 10A-pizzata: [C:03+1] Pyrra: add the MWH completeness SLO under Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) (owner: 10Elukey)
[12:04:01] <wikibugs>	 (03PS1) 10Btullis: Update the spark config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218741 (https://phabricator.wikimedia.org/T410017)
[12:05:29] <logmsgbot>	 marostegui@cumin1003 provision (PID 1202859) is awaiting input
[12:06:09] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add FIDO ssh key for mvernon - https://phabricator.wikimedia.org/T412796 (10MatthewVernon) 03NEW
[12:07:02] <wikibugs>	 (03PS1) 10Bunnypranav: core-Permission: Add abusefilter-access-protected-vars to temporary-account-viewer in jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218743 (https://phabricator.wikimedia.org/T412791)
[12:07:33] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update the spark config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218741 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[12:08:23] <wikibugs>	 (03PS1) 10MVernon: admin: add fido-backed key for mvernon [puppet] - 10https://gerrit.wikimedia.org/r/1218744 (https://phabricator.wikimedia.org/T412796)
[12:08:39] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218342|Remove definition of wgGlobalBlockingEnableAutoblocks (T379086)]], [[gerrit:1218343|Show global autoblocks in the globalblocks list API response (T379087)]] (duration: 67m 55s)
[12:08:45] <stashbot>	 T379086: Remove wgGlobalBlockingEnableAutoblocks - https://phabricator.wikimedia.org/T379086
[12:08:45] <stashbot>	 T379087: Remove wgGlobalBlockingHideAutoblocksInGlobalBlocksAPIResponse - https://phabricator.wikimedia.org/T379087
[12:08:54] <Dreamy_Jazz>	 Proceeding the train made an issue appear for one of PSI teams tools, so will want to backport shortly again :D
[12:09:39] <wikibugs>	 (03Merged) 10jenkins-bot: Update the spark config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218741 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[12:10:15] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:10:59] <wikibugs>	 (03PS1) 10MVernon: admin: add fido-based ssh key for mvernon [puppet] - 10https://gerrit.wikimedia.org/r/1218745 (https://phabricator.wikimedia.org/T412796)
[12:11:51] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] wikikube: Add wikikube-ctrl200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1195350 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine)
[12:11:52] <wikibugs>	 (03Abandoned) 10MVernon: admin: add fido-backed key for mvernon [puppet] - 10https://gerrit.wikimedia.org/r/1218744 (https://phabricator.wikimedia.org/T412796) (owner: 10MVernon)
[12:12:05] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Add wikikube-ctrl2004 and wikikube-ctrl2005 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1218351 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine)
[12:12:44] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[12:12:53] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[12:14:17] <wikibugs>	 (03PS1) 10Dreamy Jazz: Follow-up: SI: Add "past checks" link next to accounts in table pager [extensions/CheckUser] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218747 (https://phabricator.wikimedia.org/T411268)
[12:14:29] <Dreamy_Jazz>	 jouncebot: nowandnext
[12:14:29] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 45 minute(s)
[12:14:29] <jouncebot>	 In 0 hour(s) and 45 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1300)
[12:14:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218747 (https://phabricator.wikimedia.org/T411268) (owner: 10Dreamy Jazz)
[12:14:47] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218743 (https://phabricator.wikimedia.org/T412791) (owner: 10Bunnypranav)
[12:15:59] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es2028.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[12:17:39] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "@fceratto@wikimedia.org this is not yet submitted right?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215116 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto)
[12:18:15] <wikibugs>	 (03CR) 10Marostegui: "As agreed during the meeting, let's make this a separate cookbook for now, so we don't alter the existing one." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[12:19:08] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] prometheus-mariadb-replication-lag.py: mysql_heartbeat_lag_seconds metric [puppet] - 10https://gerrit.wikimedia.org/r/1217492 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[12:19:50] <wikibugs>	 (03PS1) 10Btullis: Allow the spark serviceaccount to manage secrets within the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218748 (https://phabricator.wikimedia.org/T406833)
[12:19:58] <wikibugs>	 (03PS2) 10Btullis: Allow the spark serviceaccount to manage secrets within the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218748 (https://phabricator.wikimedia.org/T406833)
[12:19:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Allow the spark serviceaccount to manage secrets within the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218748 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis)
[12:22:08] <wikibugs>	 (03PS2) 10Aqu: postgresql-airflow-main: Increase pgbouncer pool size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990)
[12:23:01] <wikibugs>	 (03CR) 10Aqu: "I've removed the duplicated declaration of the value of the number of instances (3)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990) (owner: 10Aqu)
[12:24:17] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Allow the spark serviceaccount to manage secrets within the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218748 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis)
[12:25:58] <wikibugs>	 (03Merged) 10jenkins-bot: Allow the spark serviceaccount to manage secrets within the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218748 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis)
[12:27:04] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.provision for host es2028.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[12:27:14] <wikibugs>	 (03Merged) 10jenkins-bot: Follow-up: SI: Add "past checks" link next to accounts in table pager [extensions/CheckUser] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218747 (https://phabricator.wikimedia.org/T411268) (owner: 10Dreamy Jazz)
[12:27:47] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1218747|Follow-up: SI: Add "past checks" link next to accounts in table pager (T411268)]]
[12:27:51] <stashbot>	 T411268: Suggested Investigations: Show link to checkuser log if target has been checked before - https://phabricator.wikimedia.org/T411268
[12:28:18] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[12:28:26] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[12:29:18] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] docker_registry: allow ml-build1001 [puppet] - 10https://gerrit.wikimedia.org/r/1218707 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[12:31:49] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1218747|Follow-up: SI: Add "past checks" link next to accounts in table pager (T411268)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:32:27] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[12:32:29] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add FIDO ssh key for mvernon - https://phabricator.wikimedia.org/T412796#11463708 (10Marostegui) p:05Triage→03Medium @MatthewVernon I guess you'll handle this yourself? I can verify the ssh key out of band if you need help from clinic duty.
[12:34:08] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2028.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[12:34:43] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add FIDO ssh key for mvernon - https://phabricator.wikimedia.org/T412796#11463714 (10MatthewVernon) @Marostegui I think @MoritzMuehlenhoff wanted to verify the new pubkey, so I'll tag him as reviewer on the CR.
[12:35:15] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie
[12:35:28] <wikibugs>	 (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski)
[12:35:42] <wikibugs>	 06SRE, 10decommission-hardware: decommission puppetmaster1003 - https://phabricator.wikimedia.org/T412800#11463729 (10MoritzMuehlenhoff)
[12:36:12] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add FIDO ssh key for mvernon - https://phabricator.wikimedia.org/T412796#11463731 (10Marostegui) Sounds good! Let me know if you need any help from me as I am on clinic duty this week.
[12:38:35] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218747|Follow-up: SI: Add "past checks" link next to accounts in table pager (T411268)]] (duration: 10m 47s)
[12:38:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetmaster1003.eqiad.wmnet
[12:38:39] <stashbot>	 T411268: Suggested Investigations: Show link to checkuser log if target has been checked before - https://phabricator.wikimedia.org/T411268
[12:40:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1218745 (https://phabricator.wikimedia.org/T412796) (owner: 10MVernon)
[12:41:42] <wikibugs>	 (03CR) 10MVernon: [C:03+2] admin: add fido-based ssh key for mvernon [puppet] - 10https://gerrit.wikimedia.org/r/1218745 (https://phabricator.wikimedia.org/T412796) (owner: 10MVernon)
[12:43:01] <wikibugs>	 (03CR) 10Btullis: [C:03+2] postgresql-airflow-main: Increase pgbouncer pool size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990) (owner: 10Aqu)
[12:45:00] <wikibugs>	 (03Merged) 10jenkins-bot: postgresql-airflow-main: Increase pgbouncer pool size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990) (owner: 10Aqu)
[12:45:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:48:32] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:51:11] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-main: apply
[12:51:17] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-main: apply
[12:51:33] <logmsgbot>	 jmm@cumin2002 decommission (PID 2783760) is awaiting input
[12:52:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[12:53:07] <wikibugs>	 06SRE: Migrate ipblocks from fetch_external_clouds_vendors_nets.py to HIDDENPARMA - https://phabricator.wikimedia.org/T412805 (10JMeybohm) 03NEW
[12:55:19] <logmsgbot>	 jmm@cumin2002 decommission (PID 2783760) is awaiting input
[12:55:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[12:55:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:55:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster1003.eqiad.wmnet
[12:55:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#11463871 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `puppetmaster1003.eqiad.wmnet` - puppetmaster1003....
[12:57:11] <wikibugs>	 (03PS1) 10Muehlenhoff: remove puppetmaster1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1218754 (https://phabricator.wikimedia.org/T412800)
[12:58:13] <wikibugs>	 (03PS1) 10Clément Goubert: team-sre/mw-cron: Improve dashboard and description [alerts] - 10https://gerrit.wikimedia.org/r/1218756 (https://phabricator.wikimedia.org/T412799)
[12:58:48] <wikibugs>	 (03CR) 10Urbanecm: "Thank you for the changes! I just have one last question about this, otherwise, this looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1300)
[13:01:30] <wikibugs>	 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11463897 (10JMeybohm) 05Open→03Resolved a:03JMeybohm With {T352245} resolved, this has now been completed.
[13:01:39] <godog>	 !log fix network configuration and reboot cloudcephosd1052 - T399180
[13:01:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:43] <stashbot>	 T399180: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180
[13:02:24] <wikibugs>	 (03PS2) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549)
[13:03:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807 (10cmooney) 03NEW p:05Triage→03Medium
[13:03:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:03:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11463921 (10cmooney)
[13:05:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218298 (https://phabricator.wikimedia.org/T412710) (owner: 10Hamish)
[13:05:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218302 (https://phabricator.wikimedia.org/T412713) (owner: 10Hamish)
[13:06:23] <Emperor>	 !log disable puppet on O:swift::proxy
[13:06:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:05] <wikibugs>	 06SRE, 10MinT, 10Prod-Kubernetes, 06serviceops, and 3 others: machinetranslation eqiad pods in state ContainerStatusUnknown - https://phabricator.wikimedia.org/T411058#11463940 (10Nikerabbit) See also {T386371} which mentions that one pod uses more memory than others.
[13:07:40] <wikibugs>	 (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski)
[13:08:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:09:23] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Swift-proxy: set DSCP on outbound packets to AF41 for network QoS [puppet] - 10https://gerrit.wikimedia.org/r/1218347 (owner: 10Cathal Mooney)
[13:09:52] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-build: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski)
[13:14:25] <Emperor>	 !log depool ms-fe1010 for testing
[13:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:20] <wikibugs>	 (03PS1) 10Sbisson: CX3 Build 1.0.0+20251215 [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218762 (https://phabricator.wikimedia.org/T408842)
[13:15:55] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11464044 (10ayounsi) a:05cmooney→03ayounsi
[13:15:58] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218762 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson)
[13:16:48] <wikibugs>	 (03PS1) 10Dreamy Jazz: Pin $wgCheckUserUserAgentTableMigrationStage as SCHEMA_COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218763 (https://phabricator.wikimedia.org/T361173)
[13:17:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi)
[13:17:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218763 (https://phabricator.wikimedia.org/T361173) (owner: 10Dreamy Jazz)
[13:19:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] remove puppetmaster1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1218754 (https://phabricator.wikimedia.org/T412800) (owner: 10Muehlenhoff)
[13:20:13] <wikibugs>	 (03PS3) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549)
[13:20:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11464054 (10fgiunchedi) >>! In T399180#11432250, @cmooney wrote: >>>! In T399180#11432052, @fgiunchedi wrote: >> I think the easiest would be to: >>  >> * Remove the spuri...
[13:20:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission puppetmaster1003 - https://phabricator.wikimedia.org/T412800#11464055 (10MoritzMuehlenhoff)
[13:21:21] <wikibugs>	 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11464058 (10fgiunchedi) JFYI we can now proceed with cloudcephosd1052 too
[13:23:29] <wikibugs>	 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11464079 (10MoritzMuehlenhoff) 05Resolved→03Open The various certs still need to be cleaned out, reopening
[13:24:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2028.codfw.wmnet']
[13:25:16] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Pyrra: add the MWH completeness SLO under Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) (owner: 10Elukey)
[13:29:45] <Emperor>	 !log repool ms-fe1010
[13:29:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:28] <Emperor>	 !log enable puppet on O:swift::proxy
[13:30:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:34:32] <wikibugs>	 (03PS4) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549)
[13:34:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi)
[13:34:47] <logmsgbot>	 jmm@cumin2002 upgrade-firmware (PID 2805516) is awaiting input
[13:35:54] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2028.codfw.wmnet']
[13:36:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2028.codfw.wmnet']
[13:37:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:41:05] <wikibugs>	 (03PS2) 10Elukey: DNM - Reimage: dup-uefi after the first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/1218731
[13:43:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['es2028.codfw.wmnet']
[13:43:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2028.codfw.wmnet']
[13:44:15] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[13:48:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi)
[13:49:59] <wikibugs>	 (03CR) 10Mszwarc: "Funny thing... This patch causes temp. accounts on GC lose their background (but not outline): https://phabricator.wikimedia.org/F71089735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218719 (https://phabricator.wikimedia.org/T412530) (owner: 10Tchanders)
[13:50:19] <wikibugs>	 06SRE, 06Data-Persistence: Add FIDO ssh key for mvernon - https://phabricator.wikimedia.org/T412796#11464220 (10MatthewVernon) 05Open→03Stalled a:03MatthewVernon
[13:50:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11464225 (10ayounsi) a:03Papaul @Papaul would you be ok to work with Nokia's support to figure out what those inbound errors mean ?  Thanks
[13:50:39] <wikibugs>	 06SRE, 06Data-Persistence: Add FIDO ssh key for mvernon - https://phabricator.wikimedia.org/T412796#11464228 (10MatthewVernon) Reassigning to myself to do the clearup of the software-key in due course.
[13:50:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['es2028.codfw.wmnet']
[13:51:49] <wikibugs>	 (03PS5) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549)
[13:52:45] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] Remove LoggedOut cookie logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217790 (https://phabricator.wikimedia.org/T142542) (owner: 10Gergő Tisza)
[13:53:00] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] Remove LoggedOut cookie handling [puppet] - 10https://gerrit.wikimedia.org/r/1217774 (https://phabricator.wikimedia.org/T142542) (owner: 10Gergő Tisza)
[13:54:20] <hashar>	 Dreamy_Jazz: thanks for the update, sorry I went out for lunch! I'll check the train status
[13:55:35] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2028.codfw.wmnet with OS trixie
[13:56:15] <wikibugs>	 (03CR) 10Mszwarc: "This also happens in the current situation when you visit GC, but have no permissions to do IP Reveal – e.g., going to https://meta.wikime" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218719 (https://phabricator.wikimedia.org/T412530) (owner: 10Tchanders)
[13:56:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2028.codfw.wmnet']
[14:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1400)
[14:00:04] <jouncebot>	 Bunnypranav, hamishcz, stephanebisson, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86657 and previous config saved to /var/cache/conftool/dbconfig/20251216-140008-marostegui.json
[14:00:11] <stephanebisson>	 o/
[14:00:13] <hamishcz>	 i'm here :)
[14:00:14] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[14:00:15] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[14:00:52] <bunnypranav>	 o/
[14:02:00] <wikibugs>	 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11464299 (10JMeybohm) a:05JMeybohm→03MoritzMuehlenhoff Thanks for volunteering to remove the remaining certs and cergen config during your January cleanup
[14:02:04] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[14:02:08] <stephanebisson>	 bunnypranav can you deploy your change or do you need a deployer to do it?
[14:02:27] <bunnypranav>	 I will need a deployer
[14:03:14] <stephanebisson>	 bunnypranav are you able to test it during the deployment?
[14:03:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:03:33] <bunnypranav>	 stephanebisson yes I can test it
[14:03:51] <stephanebisson>	 bunnypranav ok, I'll deploy it for you
[14:04:07] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['es2028.codfw.wmnet']
[14:04:10] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:04:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218743 (https://phabricator.wikimedia.org/T412791) (owner: 10Bunnypranav)
[14:04:17] <Dreamy_Jazz>	 \o
[14:04:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2028.codfw.wmnet']
[14:04:59] <wikibugs>	 (03Merged) 10jenkins-bot: core-Permission: Add abusefilter-access-protected-vars to temporary-account-viewer in jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218743 (https://phabricator.wikimedia.org/T412791) (owner: 10Bunnypranav)
[14:05:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218290 (owner: 10Muehlenhoff)
[14:05:22] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:05:32] <logmsgbot>	 !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1218743|core-Permission: Add abusefilter-access-protected-vars to temporary-account-viewer in jawiki (T412791)]]
[14:05:36] <stashbot>	 T412791: jawiki: Add abusefilter-access-protected-vars to temporary-account-viewer - https://phabricator.wikimedia.org/T412791
[14:05:51] <bunnypranav>	 stephanebisson: Thanks for the help!
[14:06:35] <logmsgbot>	 !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply
[14:06:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi)
[14:07:46] <logmsgbot>	 !log sbisson@deploy2002 bunnypranav, sbisson: Backport for [[gerrit:1218743|core-Permission: Add abusefilter-access-protected-vars to temporary-account-viewer in jawiki (T412791)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:07:52] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reboot-single for host sretest2003.codfw.wmnet
[14:07:53] <bunnypranav>	 testing
[14:08:04] <logmsgbot>	 !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply
[14:08:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:08:34] <bunnypranav>	 stephanebisson: All good, works as intended!
[14:08:44] <logmsgbot>	 !log sbisson@deploy2002 bunnypranav, sbisson: Continuing with sync
[14:09:04] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[14:10:01] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['es2028.codfw.wmnet']
[14:10:29] <wikibugs>	 (03CR) 10Ayounsi: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi)
[14:10:48] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2003.codfw.wmnet
[14:10:55] <wikibugs>	 (03CR) 10Mszwarc: "Reported as: https://phabricator.wikimedia.org/T412823" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218719 (https://phabricator.wikimedia.org/T412530) (owner: 10Tchanders)
[14:11:33] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-build: add missing step [puppet] - 10https://gerrit.wikimedia.org/r/1218771
[14:11:47] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-build: add missing step [puppet] - 10https://gerrit.wikimedia.org/r/1218771 (owner: 10Dpogorzelski)
[14:11:49] <wikibugs>	 (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml-build: add missing step [puppet] - 10https://gerrit.wikimedia.org/r/1218771 (owner: 10Dpogorzelski)
[14:11:52] <logmsgbot>	 !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply
[14:12:47] <logmsgbot>	 !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218743|core-Permission: Add abusefilter-access-protected-vars to temporary-account-viewer in jawiki (T412791)]] (duration: 07m 15s)
[14:12:51] <stashbot>	 T412791: jawiki: Add abusefilter-access-protected-vars to temporary-account-viewer - https://phabricator.wikimedia.org/T412791
[14:13:09] <stephanebisson>	 over to you hamishcz
[14:13:11] <logmsgbot>	 !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply
[14:13:17] <hamishcz>	 :) 
[14:13:36] <hamishcz>	 awaiting for testing/
[14:13:37] <bunnypranav>	 stephanebisson: Thank you for the quick assistance!
[14:14:57] <stephanebisson>	 hamishcz are you deploying it yourself?
[14:15:09] <hamishcz>	 nah i cant do that
[14:15:12] <hamishcz>	 need your help
[14:15:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P86658 and previous config saved to /var/cache/conftool/dbconfig/20251216-141517-marostegui.json
[14:15:26] <stephanebisson>	 hamishcz ok, I'll help you
[14:15:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11464375 (10MoritzMuehlenhoff) JFTR, I upgraded firmware and IDRAC in the mean time to the latest releases.
[14:15:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218298 (https://phabricator.wikimedia.org/T412710) (owner: 10Hamish)
[14:16:38] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: enable protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218298 (https://phabricator.wikimedia.org/T412710) (owner: 10Hamish)
[14:17:08] <logmsgbot>	 !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1218298|zhwiki: enable protection indicators (T412710)]]
[14:17:12] <stashbot>	 T412710: Enable protection indicators for zhwiki - https://phabricator.wikimedia.org/T412710
[14:19:27] <logmsgbot>	 !log sbisson@deploy2002 sbisson, hamishz: Backport for [[gerrit:1218298|zhwiki: enable protection indicators (T412710)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:19:38] <stephanebisson>	 hamishcz ^
[14:20:27] <hamishcz>	 tested and work as intended
[14:21:17] <logmsgbot>	 !log sbisson@deploy2002 sbisson, hamishz: Continuing with sync
[14:24:24] <logmsgbot>	 !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply
[14:24:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi)
[14:24:53] <logmsgbot>	 elukey@cumin1003 reimage (PID 1324133) is awaiting input
[14:24:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11464421 (10Papaul) @ayounsi what else needs to be done here?
[14:25:13] <logmsgbot>	 !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218298|zhwiki: enable protection indicators (T412710)]] (duration: 08m 05s)
[14:25:16] <stashbot>	 T412710: Enable protection indicators for zhwiki - https://phabricator.wikimedia.org/T412710
[14:25:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218302 (https://phabricator.wikimedia.org/T412713) (owner: 10Hamish)
[14:25:42] <stephanebisson>	 hamishcz ^ your other patch
[14:26:06] <logmsgbot>	 !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[14:26:08] <wikibugs>	 (03PS3) 10Elukey: DNM - Reimage: dup-uefi after the first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/1218731
[14:26:39] <wikibugs>	 (03Merged) 10jenkins-bot: svwiki: lift autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218302 (https://phabricator.wikimedia.org/T412713) (owner: 10Hamish)
[14:27:02] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:27:12] <logmsgbot>	 !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1218302|svwiki: lift autoconfirmed setting (T412713)]]
[14:27:15] <stashbot>	 T412713: Set $wgAutoConfirmCount to 10 for sv.wikipedia - https://phabricator.wikimedia.org/T412713
[14:27:37] <wikibugs>	 (03PS2) 10Elukey: admin_ng: bump kartotherian's cpu quotas to have smoother deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218737
[14:28:46] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin_ng: bump kartotherian's cpu quotas to have smoother deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218737 (owner: 10Elukey)
[14:28:49] <hamishcz>	 this one is not active yet?
[14:29:12] <moritzm>	 !log installing glibc security updates
[14:29:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:24] <stephanebisson>	 hamishcz soon
[14:29:30] <logmsgbot>	 !log sbisson@deploy2002 sbisson, hamishz: Backport for [[gerrit:1218302|svwiki: lift autoconfirmed setting (T412713)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:30:24] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938#11464449 (10Arendpieter)
[14:30:25] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P86659 and previous config saved to /var/cache/conftool/dbconfig/20251216-143025-marostegui.json
[14:31:31] <stephanebisson>	 hamishcz you can test now
[14:31:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11464460 (10Eevans) 05Open→03Resolved
[14:32:08] <hamishcz>	 gimme a sec
[14:32:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:32:38] <wikibugs>	 (03PS12) 10Daniel Kinzler: rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605
[14:32:39] <hamishcz>	 ah yes good to continue
[14:32:45] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86660 and previous config saved to /var/cache/conftool/dbconfig/20251216-143244-marostegui.json
[14:32:50] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[14:32:50] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[14:33:01] <logmsgbot>	 !log sbisson@deploy2002 sbisson, hamishz: Continuing with sync
[14:33:52] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie
[14:36:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11464499 (10ayounsi) I was working on that as we speak.  As sretest2003 was reclaimed to test hosts I was able to run some more tests.  Running the still not m...
[14:37:01] <logmsgbot>	 !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218302|svwiki: lift autoconfirmed setting (T412713)]] (duration: 09m 49s)
[14:37:05] <stashbot>	 T412713: Set $wgAutoConfirmCount to 10 for sv.wikipedia - https://phabricator.wikimedia.org/T412713
[14:37:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:37:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218762 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson)
[14:37:54] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:04-1] "Tested on Pontoon: the config file does not pass validation due to the trailing “:” highlighted." [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena)
[14:39:23] <wikibugs>	 (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20251215 [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218762 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson)
[14:39:56] <logmsgbot>	 !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1218762|CX3 Build 1.0.0+20251215 (T408842 T411779)]]
[14:40:02] <stashbot>	 T408842: Surface nominated collections in Search view - https://phabricator.wikimedia.org/T408842
[14:40:02] <stashbot>	 T411779: Handle invalid featured collection name - https://phabricator.wikimedia.org/T411779
[14:40:40] <wikibugs>	 (03PS6) 10Daniel Kinzler: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379)
[14:40:54] <wikibugs>	 (03PS3) 10Gmodena: alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782)
[14:41:40] <wikibugs>	 (03CR) 10Gmodena: alertmanager: onboard wikidata platform. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena)
[14:41:55] <wikibugs>	 (03PS1) 10Gmodena: wdqs: register blazegraph with wikidata platform [alerts] - 10https://gerrit.wikimedia.org/r/1218740 (https://phabricator.wikimedia.org/T412782)
[14:42:15] <logmsgbot>	 !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1218762|CX3 Build 1.0.0+20251215 (T408842 T411779)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:42:25] <wikibugs>	 (03PS1) 10LorenMora: [Legal Footer] Deploy Legal Footer for Phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218775 (https://phabricator.wikimedia.org/T412455)
[14:43:23] <logmsgbot>	 !log sbisson@deploy2002 sbisson: Continuing with sync
[14:44:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11464567 (10cmooney) It seems the interface can be set through the [[ https://www.debian.org/releases/trixie/example-preseed.txt | preseed ]] file...
[14:45:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86661 and previous config saved to /var/cache/conftool/dbconfig/20251216-144533-marostegui.json
[14:45:39] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[14:45:40] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[14:45:51] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2198.codfw.wmnet with reason: Maintenance
[14:46:09] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena)
[14:46:21] <hamishcz>	 stephanebisson: thanks!
[14:47:24] <logmsgbot>	 !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218762|CX3 Build 1.0.0+20251215 (T408842 T411779)]] (duration: 07m 27s)
[14:47:29] <stashbot>	 T408842: Surface nominated collections in Search view - https://phabricator.wikimedia.org/T408842
[14:47:29] <stashbot>	 T411779: Handle invalid featured collection name - https://phabricator.wikimedia.org/T411779
[14:47:49] <stephanebisson>	 over to you Dreamy_Jazz
[14:47:53] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P86662 and previous config saved to /var/cache/conftool/dbconfig/20251216-144752-marostegui.json
[14:47:57] <Dreamy_Jazz>	 Thanks
[14:48:08] <wikibugs>	 (03PS1) 10Jsn.sherman: [Moderator tools] Add data-mw-interface in addition to data-mw="interface" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218776 (https://phabricator.wikimedia.org/T409187)
[14:48:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218763 (https://phabricator.wikimedia.org/T361173) (owner: 10Dreamy Jazz)
[14:48:51] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] P::cache::haproxy: enable QOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur)
[14:49:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218776 (https://phabricator.wikimedia.org/T409187) (owner: 10Jsn.sherman)
[14:49:23] <wikibugs>	 (03Merged) 10jenkins-bot: Pin $wgCheckUserUserAgentTableMigrationStage as SCHEMA_COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218763 (https://phabricator.wikimedia.org/T361173) (owner: 10Dreamy Jazz)
[14:49:52] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1218763|Pin $wgCheckUserUserAgentTableMigrationStage as SCHEMA_COMPAT_OLD (T361173)]]
[14:49:57] <stashbot>	 T361173: Add schema migration config for cu_useragent table - https://phabricator.wikimedia.org/T361173
[14:52:07] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1218763|Pin $wgCheckUserUserAgentTableMigrationStage as SCHEMA_COMPAT_OLD (T361173)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:52:48] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[14:56:48] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218763|Pin $wgCheckUserUserAgentTableMigrationStage as SCHEMA_COMPAT_OLD (T361173)]] (duration: 06m 55s)
[14:56:52] <stashbot>	 T361173: Add schema migration config for cu_useragent table - https://phabricator.wikimedia.org/T361173
[14:57:12] <Dreamy_Jazz>	 !log Afternoon UTC backport window done
[14:57:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:45] <wikibugs>	 (03PS1) 10Btullis: Update spark/hadoop mountpoints and environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218778 (https://phabricator.wikimedia.org/T406833)
[14:59:27] <wikibugs>	 (03PS2) 10Btullis: Update spark/hadoop mountpoints and environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218778 (https://phabricator.wikimedia.org/T406833)
[15:00:05] <jouncebot>	 Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1500)
[15:02:36] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update spark/hadoop mountpoints and environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218778 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis)
[15:03:01] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P86663 and previous config saved to /var/cache/conftool/dbconfig/20251216-150301-marostegui.json
[15:04:15] <wikibugs>	 (03Merged) 10jenkins-bot: Update spark/hadoop mountpoints and environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218778 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis)
[15:05:57] <Dreamy_Jazz>	 jouncebot: nowandnext
[15:05:57] <jouncebot>	 For the next 0 hour(s) and 24 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1500)
[15:05:57] <jouncebot>	 In 0 hour(s) and 24 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1530)
[15:06:03] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[15:06:13] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[15:06:56] <Dreamy_Jazz>	 Anyone using scap in this window? Want to deploy a private code change
[15:08:30] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: end frwiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218780 (https://phabricator.wikimedia.org/T405239)
[15:10:03] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:10:09] <wikibugs>	 (03PS2) 10Kosta Harlan: hCaptcha: end frwiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218780 (https://phabricator.wikimedia.org/T405239)
[15:10:12] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest-gateway: support REST sandbox requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207267 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz)
[15:13:15] <wikibugs>	 (03PS3) 10Kosta Harlan: hCaptcha: end frwiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218780 (https://phabricator.wikimedia.org/T405239)
[15:13:37] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: end frwiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218780 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[15:15:06] <kostajh>	 I'm deploying a config patch
[15:15:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218780 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[15:16:02] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: end frwiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218780 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[15:16:34] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1218780|hCaptcha: end frwiki A/B test (T405239)]]
[15:16:38] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[15:18:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86664 and previous config saved to /var/cache/conftool/dbconfig/20251216-151809-marostegui.json
[15:18:15] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[15:18:15] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[15:18:26] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[15:18:35] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1191 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86665 and previous config saved to /var/cache/conftool/dbconfig/20251216-151834-marostegui.json
[15:18:54] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1218780|hCaptcha: end frwiki A/B test (T405239)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:22:37] <wikibugs>	 (03PS6) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549)
[15:22:50] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11464876 (10ABran-WMF) I've read through the backlog of this task and followed {T411895} to try and figure out how I could move mailman's web interface behi...
[15:23:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11464880 (10Jclark-ctr) a:05Eevans→03Jclark-ctr
[15:26:00] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[15:26:23] <gehel>	 !log cleanup temp files on archiva1002
[15:26:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:25] <wikibugs>	 (03PS1) 10Bking: bking: Add FIDO-backed SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1218782
[15:30:00] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218780|hCaptcha: end frwiki A/B test (T405239)]] (duration: 13m 26s)
[15:30:04] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1530)
[15:30:05] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[15:31:28] <wikibugs>	 (03CR) 10Jelto: unlink wikipedia25.org from ncredir, point to k8s-ingress (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[15:33:35] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me. I also asked the user to supply the key for checking via Slack, for an out-of-band identity check." [puppet] - 10https://gerrit.wikimedia.org/r/1218782 (owner: 10Bking)
[15:35:03] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:38:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi)
[15:41:11] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11464928 (10Jhancock.wm) @cmooney  There is nothing plugged into any of the ports on this server except the expected. idrac and the first 1G port....
[15:41:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1218782 (owner: 10Bking)
[15:41:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:42:11] <wikibugs>	 (03CR) 10Bking: [C:03+2] bking: Add FIDO-backed SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1218782 (owner: 10Bking)
[15:42:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11464931 (10cmooney) >>! In T412807#11464928, @Jhancock.wm wrote: > @cmooney  There is nothing plugged into any of the ports on this server except...
[15:45:02] <logmsgbot>	 !log jmm@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[15:45:20] <wikibugs>	 (03PS1) 10Cathal Mooney: DNS discovery: split responses to magru servers based on rack [puppet] - 10https://gerrit.wikimedia.org/r/1218784 (https://phabricator.wikimedia.org/T411617)
[15:45:52] <wikibugs>	 (03PS1) 10Elukey: setup.py: avoid Sphinx >= 9.x [software/homer] - 10https://gerrit.wikimedia.org/r/1218785
[15:46:06] <logmsgbot>	 !log jmm@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[15:47:24] <wikibugs>	 (03PS7) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549)
[15:47:37] <hashar>	 !log Restarting CI Jenkins
[15:47:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:58] <wikibugs>	 (03PS8) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549)
[15:51:07] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission puppetmaster2002 - https://phabricator.wikimedia.org/T412783#11464950 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[15:53:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission puppetmaster1003 - https://phabricator.wikimedia.org/T412800#11464981 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[15:55:57] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:00:05] <jouncebot>	 jelto, arnoldokoth, mutante, and arnaudb: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1600).
[16:01:51] <logmsgbot>	 !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deploy
[16:02:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:02:33] <logmsgbot>	 !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator deploy
[16:03:03] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@3a23687]: deploy phab2002 for T412825
[16:03:07] <stashbot>	 T412825: Deploy Phab/Phorge 2025-12-16 - https://phabricator.wikimedia.org/T412825
[16:03:34] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@3a23687]: deploy phab2002 for T412825 (duration: 00m 31s)
[16:03:50] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@3a23687]: deploy phab1004 for T412825
[16:04:48] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@3a23687]: deploy phab1004 for T412825 (duration: 00m 58s)
[16:05:57] <wikibugs>	 (03PS1) 10Fabfur: hiera: remove set-tos experiment [puppet] - 10https://gerrit.wikimedia.org/r/1218788 (https://phabricator.wikimedia.org/T412785)
[16:05:57] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:07:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:08:07] <wikibugs>	 (03PS2) 10Fabfur: hiera: remove set-tos experiment [puppet] - 10https://gerrit.wikimedia.org/r/1218788 (https://phabricator.wikimedia.org/T412785)
[16:10:15] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[16:14:48] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wikimediafoundation.org: Add AAAA for non-apex records as well [dns] - 10https://gerrit.wikimedia.org/r/1217582 (https://phabricator.wikimedia.org/T403269) (owner: 10Majavah)
[16:15:47] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] wikimediafoundation.org: Add AAAA for non-apex records as well [dns] - 10https://gerrit.wikimedia.org/r/1217582 (https://phabricator.wikimedia.org/T403269) (owner: 10Majavah)
[16:15:59] <logmsgbot>	 !log brett@dns1006 START - running authdns-update
[16:17:35] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: remove set-tos experiment [puppet] - 10https://gerrit.wikimedia.org/r/1218788 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur)
[16:18:15] <logmsgbot>	 !log brett@dns1006 END - running authdns-update
[16:18:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11465060 (10RKemper) >>! In T411919#11454698, @Jclark-ctr wrote: > @RKemper  I am usually here most mornings early. what day would work best for you next week to down time is...
[16:20:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11465064 (10MoritzMuehlenhoff)
[16:25:09] <wikibugs>	 (03CR) 10Dzahn: unlink wikipedia25.org from ncredir, point to k8s-ingress (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:25:54] <wikibugs>	 (03CR) 10Daniel Kinzler: rest gateway: add smoke tests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler)
[16:28:16] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Access Admin menu in Airflow - https://phabricator.wikimedia.org/T412119#11465084 (10APizzata-WMF) Thanks @BTullis, I can now see the menu!
[16:28:52] <wikibugs>	 (03CR) 10Dzahn: ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:30:05] <wikibugs>	 (03CR) 10Jelto: unlink wikipedia25.org from ncredir, point to k8s-ingress (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:31:54] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1216855 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:32:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:32:52] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86667 and previous config saved to /var/cache/conftool/dbconfig/20251216-163252-marostegui.json
[16:32:58] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[16:32:58] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[16:37:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:40:53] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:41:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217790 (https://phabricator.wikimedia.org/T142542) (owner: 10Gergő Tisza)
[16:42:43] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:43:07] <wikibugs>	 (03CR) 10Jelto: miscweb: add wikipedia25.org to extra SANs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:45:01] <icinga-wm>	 RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Tue 13 Jan 2026 04:10:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[16:45:18] <wikibugs>	 (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218791 (https://phabricator.wikimedia.org/T412816)
[16:47:00] <moritzm>	 !log installing unbound security updates
[16:47:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:01] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P86668 and previous config saved to /var/cache/conftool/dbconfig/20251216-164800-marostegui.json
[16:48:25] <wikibugs>	 (03CR) 10STran: [C:03+2] "self-merging, as ipoid is actively broken" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218791 (https://phabricator.wikimedia.org/T412816) (owner: 10STran)
[16:50:33] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218791 (https://phabricator.wikimedia.org/T412816) (owner: 10STran)
[16:51:35] <wikibugs>	 (03PS1) 10Isabelle Hurbain-Palatin: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255)
[16:52:32] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[16:52:59] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[16:53:31] <logmsgbot>	 !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[16:53:44] <wikibugs>	 (03CR) 10Elukey: [C:03+2] setup.py: avoid Sphinx >= 9.x [software/homer] - 10https://gerrit.wikimedia.org/r/1218785 (owner: 10Elukey)
[16:54:01] <logmsgbot>	 !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[16:54:24] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[16:54:46] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[17:00:05] <jouncebot>	 jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1700).
[17:00:05] <jouncebot>	 tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:30] <tgr_>	 o/
[17:00:54] <rzl>	 tgr: o/ this looks reasonable to me but because it's a VCL change I'd like to get the traffic team to deploy it
[17:01:01] <rzl>	 er, tgr_: sorry
[17:01:16] <logmsgbot>	 !log derick@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwikibooks --logwiki=metawiki Magiuser 'Renamed user f3a49d320a6984a0d6b403d313476916'  # T412784
[17:01:20] <stashbot>	 T412784: Unblock stuck global rename of Renamed user f3a49d320a6984a0d6b403d313476916 - https://phabricator.wikimedia.org/T412784
[17:01:36] <tgr_>	 sure
[17:01:38] <rzl>	 will you want to be around to test that live when it goes out? or does it just need a deployer, and we can ship it whenever?
[17:01:57] <tgr_>	 the cookie has not been emitted for years, it seems
[17:02:09] <tgr_>	 and no one seems to be sure what it did in the past
[17:02:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:02:18] <rzl>	 haha got it
[17:02:21] <rzl>	 so, nothing to test, I take it :)
[17:02:21] <tgr_>	 so I wouldn't have any idea what to test
[17:02:24] <rzl>	 cool
[17:02:50] <tgr_>	 thx!
[17:03:00] <rzl>	 in that case let me follow up and get it handled async -- sorry for the extra delay, but you can consider it taken care of
[17:03:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P86669 and previous config saved to /var/cache/conftool/dbconfig/20251216-170308-marostegui.json
[17:03:12] <rzl>	 if you don't hear anything and it doesn't get done, feel free to follow up with me or with traffic
[17:03:22] <tgr_>	 no worries, it's just cleanup in any case, not time sensitive at all
[17:03:25] <rzl>	 👍
[17:07:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:13:04] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11465357 (10Dzahn) In the change merged back in 2024: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072247/9/hieradata/common/profile/trafficserver/...
[17:14:12] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie
[17:14:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11465381 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie
[17:14:23] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.move-vlan for host es2028
[17:14:47] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[17:18:05] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host es2028 - cmooney@cumin1003"
[17:18:09] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host es2028 - cmooney@cumin1003"
[17:18:09] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:18:09] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache es2028.codfw.wmnet 140.0.192.10.in-addr.arpa 0.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[17:18:12] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) es2028.codfw.wmnet 140.0.192.10.in-addr.arpa 0.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[17:18:13] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2028
[17:18:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86670 and previous config saved to /var/cache/conftool/dbconfig/20251216-171816-marostegui.json
[17:18:24] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[17:18:24] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[17:18:30] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2028
[17:18:30] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host es2028
[17:18:33] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[17:18:42] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1194 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86671 and previous config saved to /var/cache/conftool/dbconfig/20251216-171841-marostegui.json
[17:20:18] <wikibugs>	 (03PS1) 10RLazarus: kubernetes: Set default Envoy version to 1.35.7 [puppet] - 10https://gerrit.wikimedia.org/r/1218799 (https://phabricator.wikimedia.org/T410975)
[17:20:37] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11465434 (10Dzahn) You can remove the "Prepare tcpproxy VMs for accepting traffic on the new public IPs" and general tcpproxy part from the list above. That...
[17:23:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11465441 (10Jclark-ctr) @rkemper I do not have access to run down time
[17:24:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:25:53] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] kubernetes: Set default Envoy version to 1.35.7 [puppet] - 10https://gerrit.wikimedia.org/r/1218799 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus)
[17:29:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:30:12] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11465461 (10Dzahn) >>! In T408592#11452756, @ATitkov wrote: > If anything is still not clear, please ask  Hi @ATitkov   thanks for the answer...
[17:32:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:35:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11465495 (10cmooney) >>! In T412807#11464931, @cmooney wrote: > Anyway that could also be the culprit, I'll kick off another reimage and see if it...
[17:37:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:39:14] <wikibugs>	 (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1218801
[17:39:32] <claime>	 Hmm what's going on with the errors? Is someone checking?
[17:39:48] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1218801 (owner: 10Ahmon Dancy)
[17:40:32] <dancy>	 claime: I'm noticing a lot of "Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded" errors today.
[17:40:41] <wikibugs>	 (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1218801 (owner: 10Ahmon Dancy)
[17:40:42] <claime>	 dancy: looks like circuit breaking
[17:40:51] <claime>	 (looking at the spike in logstash)
[17:42:19] <claime>	 dancy: https://grafana.wikimedia.org/goto/4lRK9PMvR?orgId=1 uhhh
[17:42:29] <claime>	 that's a lot of w-o-w increase in connections
[17:44:26] <claime>	 I don't have time to debug this unfortunately :/ It's already almost 7PM
[17:45:01] <dancy>	 Looking back on that graph over 30 days, there seems to be a steady upward trajectory for the codfw connections.
[17:45:46] <dancy>	 With a big spike around today.
[17:45:48] <claime>	 dancy: yes but even just looking at the last 2 days, we have 3x'd the max rps in codfw
[17:46:06] <claime>	 2.91k last week, 7.7k this week
[17:46:50] <claime>	 Started during the night of the 11th
[17:51:52] <wikibugs>	 (03PS7) 10Daniel Kinzler: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379)
[17:53:49] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11465617 (10cmooney) I see these lines in `/var/log/syslog` in the busybox shell: ` Dec 16 17:31:55 netcfg[1167]: INFO: Activating interface eno1n...
[17:54:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:59:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:59:45] <tappof>	 !log Cleaned up old files (not deleted by logrotate) on centrallog1002; removed the rsyslog-debug file on centrallog1002.
[17:59:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1800)
[18:02:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[18:07:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[18:17:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11465779 (10elukey) @cmooney I am +1 on testing something like `d-i netcfg/link_wait_timeout string 10`, it seems an easy one to see if anything c...
[18:18:41] <wikibugs>	 (03CR) 10Eric Gardner: [C:03+2] Decommission Article Summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217799 (https://phabricator.wikimedia.org/T411558) (owner: 10Kimberly Sarabia)
[18:20:36] <wikibugs>	 (03Merged) 10jenkins-bot: Decommission Article Summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217799 (https://phabricator.wikimedia.org/T411558) (owner: 10Kimberly Sarabia)
[18:20:51] <jinxer-wm>	 FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/3/3 (Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[18:21:06] <wikibugs>	 (03PS1) 10CDanis: Revert^2 "zramswap: notify service on config change" [puppet] - 10https://gerrit.wikimedia.org/r/1218805
[18:21:06] <volans>	 !incidents
[18:21:06] <sirenbot>	 7195 (UNACKED)  TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad)
[18:21:14] <volans>	 !ack 7195
[18:21:14] <sirenbot>	 7195 (ACKED)  TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad)
[18:23:16] <wikibugs>	 (03PS2) 10Isabelle Hurbain-Palatin: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255)
[18:23:16] <wikibugs>	 (03PS1) 10Isabelle Hurbain-Palatin: Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255)
[18:24:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[18:25:51] <jinxer-wm>	 RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/3/3 (Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[18:27:02] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[18:27:42] <wikibugs>	 (03PS1) 10Majavah: P:mail::smarthost: Include Exim queue Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1218807
[18:27:43] <wikibugs>	 (03PS1) 10Majavah: P:mail::smarthost: Remove NRPE monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1218808
[18:30:55] <wikibugs>	 (03CR) 10Bking: [C:03+2] alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena)
[18:32:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86672 and previous config saved to /var/cache/conftool/dbconfig/20251216-183208-marostegui.json
[18:32:15] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[18:32:15] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[18:33:36] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:mail::smarthost: Include Exim queue Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1218807 (owner: 10Majavah)
[18:33:47] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:34:17] <wikibugs>	 (03CR) 10Bking: [C:03+1] wdqs: register blazegraph with wikidata platform [alerts] - 10https://gerrit.wikimedia.org/r/1218740 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena)
[18:35:02] <wikibugs>	 (03PS2) 10Isabelle Hurbain-Palatin: Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255)
[18:35:03] <wikibugs>	 (03PS3) 10Isabelle Hurbain-Palatin: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255)
[18:35:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[18:37:39] <wikibugs>	 (03CR) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery)
[18:38:46] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2028.codfw.wmnet with OS trixie
[18:38:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11465914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie execu...
[18:47:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P86673 and previous config saved to /var/cache/conftool/dbconfig/20251216-184717-marostegui.json
[18:47:41] <wikibugs>	 (03PS3) 10Isabelle Hurbain-Palatin: Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255)
[19:00:04] <jouncebot>	 dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1900)
[19:02:11] <dancy>	 o/
[19:02:26] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P86674 and previous config saved to /var/cache/conftool/dbconfig/20251216-190225-marostegui.json
[19:04:28] <wikibugs>	 (03PS1) 10Dzahn: admin_ng/miscweb: remove donate.wikipedia25.org from tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592)
[19:04:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng/miscweb: remove donate.wikipedia25.org from tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[19:04:52] <wikibugs>	 (03PS2) 10Dzahn: admin_ng/miscweb: remove donate.wikipedia25.org from tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592)
[19:05:36] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] miscweb: add wikipedia25.org to extra SANs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[19:05:45] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1218813" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[19:06:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 343554512 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:08:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 42704 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:10:06] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11466031 (10ATitkov) > Would it be ok with you if we do that next week, on December 22nd?   Yes, I think also Friday 19 Dec is possible, sinc...
[19:11:35] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218814 (https://phabricator.wikimedia.org/T408277)
[19:11:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218814 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot)
[19:12:25] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218814 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot)
[19:13:39] <wikibugs>	 (03CR) 10Dzahn: ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[19:14:20] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11466051 (10ATitkov) In regards to the request that the site should be published at 8:30 UTC on Jan 15th 2026, I am wondering if we can use a...
[19:16:43] <wikibugs>	 (03PS3) 10Dzahn: ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592)
[19:17:27] <wikibugs>	 (03CR) 10Dzahn: "rebased and answered inline question" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[19:17:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86675 and previous config saved to /var/cache/conftool/dbconfig/20251216-191733-marostegui.json
[19:17:40] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[19:17:40] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[19:17:50] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[19:17:59] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86676 and previous config saved to /var/cache/conftool/dbconfig/20251216-191759-marostegui.json
[19:18:43] <logmsgbot>	 !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.7  refs T408277
[19:18:47] <stashbot>	 T408277: 1.46.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T408277
[19:19:35] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] "Tested the command hint and the dashboard link with a recent task and they both work as expected." [alerts] - 10https://gerrit.wikimedia.org/r/1218756 (https://phabricator.wikimedia.org/T412799) (owner: 10Clément Goubert)
[19:23:18] <logmsgbot>	 !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1148.eqiad.wmnet with reason: T411919
[19:23:22] <stashbot>	 T411919: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919
[19:24:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11466095 (10RKemper) >>! In T411919#11465441, @Jclark-ctr wrote: > @rkemper I do not have access to run down time   Ah, didn't realize. Okay, I put a downtime on `an-worker114...
[19:25:11] <wikibugs>	 (03PS1) 10Milimetric: trafficserver: Send /evt-502b/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863)
[19:25:36] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2200.codfw.wmnet with reason: Maintenance
[19:25:55] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2208.codfw.wmnet with reason: Maintenance
[19:26:04] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86678 and previous config saved to /var/cache/conftool/dbconfig/20251216-192603-marostegui.json
[19:26:09] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[19:26:09] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[19:26:48] <wikibugs>	 (03PS11) 10Dzahn: unlink wikipedia25.org from ncredir, point to k8s-ingress [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592)
[19:29:05] <wikibugs>	 (03CR) 10Dzahn: unlink wikipedia25.org from ncredir, point to k8s-ingress (034 comments) [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[19:31:53] <wikibugs>	 (03PS1) 10Ahmon Dancy: Update wmf-config hacks for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1218818
[19:33:05] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+2] Update wmf-config hacks for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1218818 (owner: 10Ahmon Dancy)
[19:33:57] <wikibugs>	 (03Merged) 10jenkins-bot: Update wmf-config hacks for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1218818 (owner: 10Ahmon Dancy)
[19:41:54] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin_ng/miscweb: remove donate.wikipedia25.org from tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[19:44:11] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:48:30] <wikibugs>	 (03PS12) 10Dzahn: unlink wikipedia25.org from ncredir, point to geoip text-addrs [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592)
[19:48:38] <wikibugs>	 (03CR) 10Dzahn: unlink wikipedia25.org from ncredir, point to geoip text-addrs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[19:49:24] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng/miscweb: remove donate.wikipedia25.org from tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[19:55:15] <wikibugs>	 (03CR) 10Dzahn: "realizing now this is just like a cleanup that can happen any time later.. on or after Jan 15 - the DNS change is the only thing that matt" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[20:02:06] <wikibugs>	 (03PS1) 10Herron: arclamp: reduce compress days [puppet] - 10https://gerrit.wikimedia.org/r/1218821 (https://phabricator.wikimedia.org/T412842)
[20:02:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] arclamp: reduce compress days [puppet] - 10https://gerrit.wikimedia.org/r/1218821 (https://phabricator.wikimedia.org/T412842) (owner: 10Herron)
[20:03:14] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11466178 (10Dzahn) >>! In T408592#11466031, @ATitkov wrote: >> Would it be ok with you if we do that next week, on December 22nd?  >  > Yes,...
[20:03:19] <wikibugs>	 (03PS2) 10Herron: arclamp: reduce compress days [puppet] - 10https://gerrit.wikimedia.org/r/1218821 (https://phabricator.wikimedia.org/T412842)
[20:05:52] <wikibugs>	 (03CR) 10Dzahn: "It might be considered nicer to just change the 2 relevant lines in an existing zone file.. but since this is currently on ncredir.. it is" [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[20:08:37] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11466187 (10Dzahn) Once we have moved the repo to the new location, and with the config for CI to build the docker images that Jelto has alre...
[20:10:24] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[20:10:40] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] arclamp: reduce compress days [puppet] - 10https://gerrit.wikimedia.org/r/1218821 (https://phabricator.wikimedia.org/T412842) (owner: 10Herron)
[20:12:02] <wikibugs>	 (03CR) 10Herron: [C:03+2] "Thanks @dzahn@wikimedia.org!" [puppet] - 10https://gerrit.wikimedia.org/r/1218821 (https://phabricator.wikimedia.org/T412842) (owner: 10Herron)
[20:16:14] <logmsgbot>	 !log dzahn@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[20:17:04] <logmsgbot>	 !log dzahn@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[20:17:24] <mutante>	 I was going to deploy something to admin_ng on k8s but I said  NO to the diff.
[20:17:34] <mutante>	 reason: unrelated changes in my diff. undeployed.
[20:18:12] <mutante>	 thinking about fully reverting mine or leaving it as it is
[20:18:28] <mutante>	 afaict there are even 2 different undeployed but merged changes
[20:18:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 537000344 and 56 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[20:19:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2164432 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[20:31:53] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86680 and previous config saved to /var/cache/conftool/dbconfig/20251216-203153-marostegui.json
[20:31:59] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[20:32:00] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[20:33:24] <wikibugs>	 (03PS1) 10Eric Gardner: Delay StickyHeaders section click instrumentation for slow loads [extensions/WikimediaEvents] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218825 (https://phabricator.wikimedia.org/T412857)
[20:36:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218825 (https://phabricator.wikimedia.org/T412857) (owner: 10Eric Gardner)
[20:39:26] <wikibugs>	 (03PS1) 10Kosta Harlan: product_metrics.special_create_account: Collect mediawiki_database [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866)
[20:40:00] <wikibugs>	 (03CR) 10Clare Ming: [C:03+1] product_metrics.special_create_account: Collect mediawiki_database [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) (owner: 10Kosta Harlan)
[20:40:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) (owner: 10Kosta Harlan)
[20:40:19] <icinga-wm>	 PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 4073.50 ms
[20:40:41] <icinga-wm>	 RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 33%, RTA = 620.39 ms
[20:46:55] <wikibugs>	 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11466380 (10AKanji-WMF)
[20:47:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P86681 and previous config saved to /var/cache/conftool/dbconfig/20251216-204701-marostegui.json
[20:58:09] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] "Confirming that this should work. `mediawiki.database AIUI `$wgDBname` is e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) (owner: 10Kosta Harlan)
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T2100).
[21:00:05] <jouncebot>	 JSherman, tgr, EricGardner, and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:15] <JSherman>	 \o
[21:00:17] <tgr_>	 o/
[21:00:24] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] "Sorry." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) (owner: 10Kosta Harlan)
[21:00:38] <tgr_>	 my patch is a noop, feel free to bundle it with something else
[21:00:40] <kostajh>	 i'm here
[21:00:44] <kostajh>	 same for mine
[21:01:20] <JSherman>	 ah, very good
[21:01:56] <JSherman>	 Those are also config, so they should go relatively fast
[21:02:10] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P86682 and previous config saved to /var/cache/conftool/dbconfig/20251216-210210-marostegui.json
[21:02:16] <EricGardner>	 I can wait until other patches are done to deploy mine (which is just an instrumentation change)
[21:02:29] <JSherman>	 I'm happy to deploy if we don't have another deployer on hand
[21:02:51] <JSherman>	 EricGardener: mine is really low risk, maybe we could bundle ours together to save time?
[21:02:56] <EricGardner>	 Sure, sounds good
[21:03:21] <JSherman>	 okay, I'll start with tgr_: and kostajh: together
[21:03:32] <kostajh>	 thanks
[21:04:51] <JSherman>	 kostajh: it didn't want to let me bundle yours; I'll to tgr_ and then you
[21:06:07] <JSherman>	 oh, actually it was yours, tgr_: 
[21:06:07] <JSherman>	 > Error for Change '1217790', project: 'operations/mediawiki-config', branch: 'master':
[21:06:07] <JSherman>	 Change '1217790' has dependency '1203252' targeting the master branch
[21:06:07] <JSherman>	 of MediaWiki code project 'mediawiki/core', but the dependency is not
[21:06:07] <JSherman>	 present in live train branch: wmf/1.46.0-wmf.5
[21:06:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) (owner: 10Kosta Harlan)
[21:07:29] <wikibugs>	 (03Merged) 10jenkins-bot: product_metrics.special_create_account: Collect mediawiki_database [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) (owner: 10Kosta Harlan)
[21:07:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 42927688 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:08:02] <logmsgbot>	 !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1218828|product_metrics.special_create_account: Collect mediawiki_database (T412866)]]
[21:08:06] <stashbot>	 T412866: product_metrics.special_create_account: Record the wiki used on action=submit - https://phabricator.wikimedia.org/T412866
[21:08:12] <tgr_>	 hm I suppose scap is correct on that
[21:08:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3253376 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:08:41] <tgr_>	 it's not really dependent on the other patch, I just wanted to link to it
[21:08:52] <tgr_>	 in any case, I can just wait until Thursday
[21:09:31] <JSherman>	 kostajh: is there any testing to do for yours, or just move on if it deploys happily?
[21:09:55] <JSherman>	 tgr_: ack
[21:10:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217790 (https://phabricator.wikimedia.org/T142542) (owner: 10Gergő Tisza)
[21:11:34] <kostajh>	 JSherman: you can just sync it
[21:12:05] <JSherman>	 kostajh: ack
[21:13:06] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218830
[21:13:39] <wikibugs>	 (03CR) 10BBlack: [C:03+1] "I would just go ahead and remove all 3 in one patch really, but perhaps check turnilo to see if we have any recent samples of matching tra" [puppet] - 10https://gerrit.wikimedia.org/r/1215329 (owner: 10Dzahn)
[21:13:57] <wikibugs>	 (03PS4) 10C. Scott Ananian: Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:13:57] <wikibugs>	 (03PS4) 10C. Scott Ananian: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:15:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 232649800 and 59 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:16:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:17:19] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86683 and previous config saved to /var/cache/conftool/dbconfig/20251216-211718-marostegui.json
[21:17:25] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[21:17:25] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[21:17:35] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[21:17:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86684 and previous config saved to /var/cache/conftool/dbconfig/20251216-211743-marostegui.json
[21:19:33] <wikibugs>	 (03PS5) 10C. Scott Ananian: Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:19:33] <wikibugs>	 (03PS5) 10C. Scott Ananian: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:27:07] <wikibugs>	 (03CR) 10C. Scott Ananian: [C:03+1] Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:30:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1280712320 and 90 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:31:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3731512 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:33:19] <JSherman>	 we're waiting still on the "building container images" step; no errors at this time.
[21:33:59] <wikibugs>	 (03PS6) 10C. Scott Ananian: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:34:07] <wikibugs>	 (03CR) 10C. Scott Ananian: Enable post-processing cache for all Parsoid-rendered wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:35:02] <wikibugs>	 (03CR) 10C. Scott Ananian: [C:03+1] Enable post-processing cache for all Parsoid-rendered wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:38:28] <wikibugs>	 (03PS1) 10Bking: opensearch-cluster: Replace reload certificates API call with hot reload setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218834 (https://phabricator.wikimedia.org/T412447)
[21:39:25] <dancy>	 JSherman: It will take a long time due to the localisation rebuild.
[21:40:04] <JSherman>	 dancy: ack
[21:40:21] <dancy>	 And pushing the image to the registry might be sketchy (T412265).  Fingers crossed!
[21:40:22] <stashbot>	 T412265: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265
[21:41:27] <JSherman>	 yeah, I saw that happen in a window last week; hoping we're just taking our time on that i18n cache build!
[21:43:58] <EricGardner>	 brb
[21:45:42] <JSherman>	 EricGardner: ack; we might be able to overrun into the next window as it's noted as often skipped. I won't be able to stay for that whole window though.
[21:45:52] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11466529 (10Dzahn)
[21:46:58] <EricGardner>	 I can stay for that window too
[21:47:26] <EricGardner>	 (theoretically that window belongs to my team and the new reader experiences team anyway, since web team is no more)
[21:48:03] <JSherman>	 I didn't know who inherited it!
[21:48:26] <EricGardner>	 Yeah I suppose we should update that on the deployments page at some point
[21:58:21] <Amir1>	 !log mwscript-k8s --follow -- findBadBlobs.php --wiki elwiki --mark "Corrupted UTF-8 (T351953)" --revisions 26381,30551 (T351953)
[21:58:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:25] <stashbot>	 T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T2200)
[22:02:55] <logmsgbot>	 !log jsn@deploy2002 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.5,1.46.0-wmf.7,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/mediawi
[22:02:55] <logmsgbot>	 ki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.229.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawiki-s
[22:02:55] <logmsgbot>	 taging/scap/image-build' returned non-zero exit status 1. (scap version: 4.229.0) (duration: 54m 53s)
[22:04:09] <JSherman>	 Well, that failed
[22:06:03] <mutante>	 JSherman: if it failed when trying to upload the docker image, possibly https://phabricator.wikimedia.org/T412265
[22:07:19] <JSherman>	 Yeah, I guess the question is, what to do now; revert?
[22:07:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11466610 (10Jhancock.wm) time delayed reply.  @cmooney BIOS > integrated devices > (pick appropriate interface) > NIC configuration > Legacy Boot...
[22:10:03] <mutante>	 JSherman: dont have a good answer but maybe "try it one more time" and if you can repeat it.. THEN revert
[22:12:42] <JSherman>	 I'll give it a shot
[22:13:41] <logmsgbot>	 !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1218828|product_metrics.special_create_account: Collect mediawiki_database (T412866)]]
[22:13:45] <stashbot>	 T412866: product_metrics.special_create_account: Record the wiki used on action=submit - https://phabricator.wikimedia.org/T412866
[22:19:55] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[22:20:39] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[22:22:21] <icinga-wm>	 PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 100%
[22:23:29] <icinga-wm>	 RECOVERY - Host wikikube-worker1275 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[22:28:46] <urbanecm>	 JSherman: would you also mind noting that on the task what happened? even if it passes on the second try, it makes sense to have it recorded.
[22:29:11] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[22:29:13] <JSherman>	 urbanecm: ack
[22:29:18] <urbanecm>	 ty
[22:29:37] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "LGTM, thanks for your work on this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery)
[22:33:43] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11466660 (10jsn.sherman) This happened again in the UTC late backport window: https://sal.toolforge.org/log/5jwwKZsBvg159pQrFeSI  https://spiderpig.wikimedia.org/j...
[22:40:49] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply
[22:41:09] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[22:41:42] <wikibugs>	 (03PS1) 10Robertsky: lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820)
[22:42:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky)
[22:46:47] <wikibugs>	 (03PS8) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169)
[22:47:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery)
[22:47:41] <wikibugs>	 (03CR) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery)
[22:47:54] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky)
[22:50:04] <wikibugs>	 (03PS9) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169)
[22:50:45] <logmsgbot>	 !log jsn@deploy2002 kharlan, jsn: Backport for [[gerrit:1218828|product_metrics.special_create_account: Collect mediawiki_database (T412866)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:50:49] <stashbot>	 T412866: product_metrics.special_create_account: Record the wiki used on action=submit - https://phabricator.wikimedia.org/T412866
[22:50:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery)
[22:51:15] <logmsgbot>	 !log jsn@deploy2002 kharlan, jsn: Continuing with sync
[22:51:41] <wikibugs>	 (03PS10) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169)
[22:57:34] <EricGardner>	 JSherman: are you still waiting on this task to complete?
[22:59:20] <dancy>	 EricGardner: The last phase of the deployment is still in progress.  38% done
[23:00:36] <EricGardner>	 dancy: thanks! I will stay tuned I guess
[23:00:50] <JSherman>	 EricGardner: yep, didn't expect the one to take 2 hrs!
[23:01:51] <JSherman>	 I will absolutely have to drop after this completes
[23:02:24] <urbanecm>	 JSherman: yeah, last time i run into this, i spent ~4 hrs in total (two attempts and a revert) :/. i hope that's not the case here.
[23:02:52] <urbanecm>	 but it has built, which is good
[23:03:15] <JSherman>	 Crossing my fingers here at ~90%
[23:04:26] <logmsgbot>	 !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218828|product_metrics.special_create_account: Collect mediawiki_database (T412866)]] (duration: 50m 45s)
[23:04:30] <stashbot>	 T412866: product_metrics.special_create_account: Record the wiki used on action=submit - https://phabricator.wikimedia.org/T412866
[23:04:43] <JSherman>	 Finished!
[23:04:57] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875 (10dr0ptp4kt) 03NEW
[23:05:20] <JSherman>	 EricGardner: I'm really sorry we got bumped
[23:06:31] <EricGardner>	 JSherman: No prob – are you sticking around to backport your patch now? I may be able to do both of ours if you have to go
[23:07:50] <JSherman>	 I have to drop, so that would be great
[23:08:38] <JSherman>	 Mine is just adding an extra data attribute for a future change, so it shouldn't have any impact
[23:17:54] <EricGardner>	 Ok. If no one here objects, I will proceed with deploying JSherman's patch (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1218776) as well as my own (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1218825) since we got bumped out of our window earlier
[23:19:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218776 (https://phabricator.wikimedia.org/T409187) (owner: 10Jsn.sherman)
[23:19:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218825 (https://phabricator.wikimedia.org/T412857) (owner: 10Eric Gardner)
[23:23:31] <wikibugs>	 (03Merged) 10jenkins-bot: [Moderator tools] Add data-mw-interface in addition to data-mw="interface" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218776 (https://phabricator.wikimedia.org/T409187) (owner: 10Jsn.sherman)
[23:27:39] <wikibugs>	 (03Merged) 10jenkins-bot: Delay StickyHeaders section click instrumentation for slow loads [extensions/WikimediaEvents] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218825 (https://phabricator.wikimedia.org/T412857) (owner: 10Eric Gardner)
[23:28:18] <logmsgbot>	 !log egardner@deploy2002 Started scap sync-world: Backport for [[gerrit:1218776|[Moderator tools] Add data-mw-interface in addition to data-mw="interface" (T409187)]], [[gerrit:1218825|Delay StickyHeaders section click instrumentation for slow loads (T412857)]]
[23:28:23] <stashbot>	 T409187: The `data-mw` attribute should be reserved for Parsoid use; rename data-mw="interface" to data-mw-interface - https://phabricator.wikimedia.org/T409187
[23:28:24] <stashbot>	 T412857: Sticky Headers: Distinguish automatic vs user-initiated section toggles - https://phabricator.wikimedia.org/T412857
[23:32:30] <logmsgbot>	 !log egardner@deploy2002 jsn, egardner: Backport for [[gerrit:1218776|[Moderator tools] Add data-mw-interface in addition to data-mw="interface" (T409187)]], [[gerrit:1218825|Delay StickyHeaders section click instrumentation for slow loads (T412857)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:34:08] <logmsgbot>	 !log egardner@deploy2002 jsn, egardner: Continuing with sync
[23:40:04] <logmsgbot>	 !log egardner@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218776|[Moderator tools] Add data-mw-interface in addition to data-mw="interface" (T409187)]], [[gerrit:1218825|Delay StickyHeaders section click instrumentation for slow loads (T412857)]] (duration: 11m 47s)
[23:40:10] <stashbot>	 T409187: The `data-mw` attribute should be reserved for Parsoid use; rename data-mw="interface" to data-mw-interface - https://phabricator.wikimedia.org/T409187
[23:40:10] <stashbot>	 T412857: Sticky Headers: Distinguish automatic vs user-initiated section toggles - https://phabricator.wikimedia.org/T412857
[23:40:48] <EricGardner>	 JSherman: your patch is deployed
[23:45:24] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:53:29] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] kubernetes: Set default Envoy version to 1.35.7 [puppet] - 10https://gerrit.wikimedia.org/r/1218799 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus)