[00:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:38:43] (03PS1) 10Pppery: Rename raw.json to en-x-raw.json [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217874 (https://phabricator.wikimedia.org/T412646) [00:39:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217875 [00:39:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217875 (owner: 10TrainBranchBot) [00:51:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1217875 (owner: 10TrainBranchBot) [00:52:03] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:53:32] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:00:38] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:09:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217877 [01:09:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217877 (owner: 10TrainBranchBot) [01:13:48] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 09s) [01:34:26] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1217877 (owner: 10TrainBranchBot) [01:55:03] (03PS1) 10Pppery: Don't delete the @metadata key that translatewiki.net exports add [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217878 (https://phabricator.wikimedia.org/T412649) [01:56:26] (03PS2) 10Pppery: Rm frequency.json file [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217871 [01:56:54] (03PS2) 10Pppery: Rm "translations" into English [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217872 [01:58:09] (03PS2) 10Pppery: Don't delete the @metadata key that translatewiki.net exports add [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217878 (https://phabricator.wikimedia.org/T412649) [02:07:00] (03PS1) 10Pppery: Sort list of file references [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217879 (https://phabricator.wikimedia.org/T412649) [02:09:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86603 and previous config saved to /var/cache/conftool/dbconfig/20251215-020923-marostegui.json [02:09:29] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:09:30] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:09:44] (03PS1) 10Pppery: Suppress a spurious "File does not exist: " error at the end of string generation [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217880 [02:20:56] (03PS3) 10Pppery: Add an internal translation file for this repo's own strings [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217873 (https://phabricator.wikimedia.org/T412651) [02:22:00] (03PS3) 10Pppery: Rm "translations" into English [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217872 (https://phabricator.wikimedia.org/T412651) [02:24:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P86604 and previous config saved to /var/cache/conftool/dbconfig/20251215-022431-marostegui.json [02:25:14] (03PS2) 10Pppery: Suppress a spurious "File does not exist: " error at the end of string generation [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217880 (https://phabricator.wikimedia.org/T412652) [02:25:49] (03PS3) 10Pppery: Rm frequency.json file [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217871 (https://phabricator.wikimedia.org/T412652) [02:26:11] (03PS4) 10Pppery: Rm "translations" into English [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217872 (https://phabricator.wikimedia.org/T412651) [02:30:02] (03PS1) 10Pppery: Rm .arcconfig [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217881 (https://phabricator.wikimedia.org/T412652) [02:36:46] FIRING: [2x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [02:37:38] (03PS1) 10Pppery: Rename `--locale` arg to `--classlocale` [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217882 (https://phabricator.wikimedia.org/T412650) [02:39:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P86605 and previous config saved to /var/cache/conftool/dbconfig/20251215-023940-marostegui.json [02:54:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86606 and previous config saved to /var/cache/conftool/dbconfig/20251215-025448-marostegui.json [02:54:54] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:54:55] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:55:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:56:46] RESOLVED: [2x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [03:35:42] (03PS1) 10Pppery: Remove translations that were deleted from translatewiki.net [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217883 (https://phabricator.wikimedia.org/T412652) [03:41:32] (03PS4) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422) [04:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:33:31] PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [04:43:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [04:52:03] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:53:32] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:10:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:09:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1222.eqiad.wmnet with reason: Maintenance [06:17:05] 06SRE, 06serviceops: New WMF docker registry credentials - https://phabricator.wikimedia.org/T412524#11458693 (10Marostegui) [06:17:10] 06SRE, 06serviceops: New WMF docker registry credentials - https://phabricator.wikimedia.org/T412524#11458694 (10Marostegui) p:05Triage→03Medium [06:23:08] (03PS1) 10Marostegui: instances.yaml: Remove sretest2003 [puppet] - 10https://gerrit.wikimedia.org/r/1217891 (https://phabricator.wikimedia.org/T411173) [06:25:30] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove sretest2003 [puppet] - 10https://gerrit.wikimedia.org/r/1217891 (https://phabricator.wikimedia.org/T411173) (owner: 10Marostegui) [06:26:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove sretest2003 from dbctl T411173', diff saved to https://phabricator.wikimedia.org/P86607 and previous config saved to /var/cache/conftool/dbconfig/20251215-062645-marostegui.json [06:26:49] T411173: Remove sretest2003 (config H 1P) from production - https://phabricator.wikimedia.org/T411173 [06:28:59] (03PS1) 10Marostegui: site.pp: Remove sretest2003 from production [puppet] - 10https://gerrit.wikimedia.org/r/1217892 (https://phabricator.wikimedia.org/T411173) [06:31:10] (03CR) 10Marostegui: [C:03+2] site.pp: Remove sretest2003 from production [puppet] - 10https://gerrit.wikimedia.org/r/1217892 (https://phabricator.wikimedia.org/T411173) (owner: 10Marostegui) [06:33:30] !log Stop MariaDB on sretest2003 - T411173 [06:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:34] T411173: Remove sretest2003 (config H 1P) from production - https://phabricator.wikimedia.org/T411173 [06:38:17] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:45:44] (03PS1) 10Marostegui: sretest2003.yaml: Remove host [puppet] - 10https://gerrit.wikimedia.org/r/1218045 (https://phabricator.wikimedia.org/T411173) [06:54:13] (03CR) 10Marostegui: [C:03+2] sretest2003.yaml: Remove host [puppet] - 10https://gerrit.wikimedia.org/r/1218045 (https://phabricator.wikimedia.org/T411173) (owner: 10Marostegui) [06:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:07:24] (03CR) 10Arnaudb: [C:03+2] collab: add vrts junk queue alert [alerts] - 10https://gerrit.wikimedia.org/r/1217548 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [07:09:02] (03Merged) 10jenkins-bot: collab: add vrts junk queue alert [alerts] - 10https://gerrit.wikimedia.org/r/1217548 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [07:16:26] (03CR) 10Muehlenhoff: [C:03+2] Stop using puppetmaster2002 for Blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/1215548 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:22:19] (03PS1) 10Muehlenhoff: Remove access for joelyrookewmde [puppet] - 10https://gerrit.wikimedia.org/r/1218080 (https://phabricator.wikimedia.org/T412508) [07:23:03] (03CR) 10CI reject: [V:04-1] Remove access for joelyrookewmde [puppet] - 10https://gerrit.wikimedia.org/r/1218080 (https://phabricator.wikimedia.org/T412508) (owner: 10Muehlenhoff) [07:31:35] (03PS2) 10Muehlenhoff: Remove access for joelyrookewmde [puppet] - 10https://gerrit.wikimedia.org/r/1218080 (https://phabricator.wikimedia.org/T412508) [07:52:20] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Suppress a spurious "File does not exist: " error at the end of string generation [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217880 (https://phabricator.wikimedia.org/T412652) (owner: 10Pppery) [08:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T0800). [08:00:05] sfaci and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:08] o/ [08:00:20] \o [08:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:13:17] 06SRE, 06serviceops: New WMF docker registry credentials - https://phabricator.wikimedia.org/T412524#11458891 (10elukey) [08:33:08] 06SRE, 06serviceops: New WMF docker registry credentials - https://phabricator.wikimedia.org/T412524#11458924 (10elukey) High level things to do afaics: 1) Our current strategy so far has been to add docker credentials to the root's home directory, essentially to limit the users that can push images. The buil... [08:40:23] (03Abandoned) 10Federico Ceratto: Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [08:43:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [08:44:58] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v12.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1218205 [08:48:57] (03CR) 10Marostegui: [C:03+1] clone.py: Upsert instance data in Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1214083 (https://phabricator.wikimedia.org/T410084) (owner: 10Federico Ceratto) [08:52:03] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:53:22] (03CR) 10Marostegui: sre.mysql.pool: [de]pool various section kinds (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [08:54:48] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v12.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1218205 (owner: 10Elukey) [08:57:03] (03PS1) 10Elukey: Upstream release v12.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1218206 [08:57:21] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v12.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1218206 (owner: 10Elukey) [09:03:38] (03PS4) 10Federico Ceratto: sre.mysql.pool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [09:08:25] (03CR) 10Federico Ceratto: "It was due to linters erroring on docstrings and whitespaces, now fixed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [09:08:32] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [09:13:26] (03CR) 10Marostegui: "Can you please make a note on the patch as I mentioned above?" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:30:12] (03CR) 10Effie Mouzeli: [C:03+1] wikikube-staging: Bump calico memory requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214125 (owner: 10Clément Goubert) [09:31:14] jouncebot nowandnext [09:31:15] No deployments scheduled for the next 1 hour(s) and 28 minute(s) [09:31:15] In 1 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1100) [09:33:07] (03PS1) 10Ayounsi: [WIP] Capirca: only show diff when running in "non-commit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218209 (https://phabricator.wikimedia.org/T361549) [09:33:11] Any issue with me doing two low risk backports? The UTC morning backport didn't appear to run [09:36:16] I'm not qualified to give an answer to that, but can you ping me when you're done? (I'd like to plop a config patch at some point today) [09:36:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2204.codfw.wmnet with reason: Maintenance [09:38:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217768 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [09:38:30] (03PS1) 10Ayounsi: Netbox scripts: remove the scheduling UI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218210 [09:38:46] (i'll actually put that in the window this afternoon, that'll be easier) [09:44:52] (03CR) 10Majavah: [C:03+2] network: Remove unused cloud_nova_hosts_ranges variable [puppet] - 10https://gerrit.wikimedia.org/r/1214099 (owner: 10Majavah) [09:44:53] (03PS1) 10Dpogorzelski: ml-build: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1218211 [09:46:55] slyngs, akosiaris: There are two patches scheduled for deployment during the UTC morning backport window that didn't get deployed. AIUI they need to be deployed before the next Test Kitchen experiment deployment window. The patches are low risk. Is there any issue with me deploying them? [09:46:58] !log uploaded spicerack_12.2.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [09:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:20] phuedx: not a SRE, but i don't think there's an issue [09:47:41] phuedx: I don't think so [09:48:15] Thanks both <3 [09:48:38] You break it you buy it :-) [09:50:12] I've got a couple of coins and a Danish pastry [09:50:14] (03PS2) 10Ayounsi: Netbox scripts: remove the scheduling UI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218210 [09:51:05] heads up we (cc @ihurbain) are going to deploy kartotherian on eqiad/codfw after testing on staging looks oK [09:51:15] o/ [09:52:06] (03CR) 10Effie Mouzeli: "To facilitate reviewing, I would suggest to:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217570 (owner: 10Jasmine) [09:52:31] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [09:52:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217783 (https://phabricator.wikimedia.org/T405177) (owner: 10DLynch) [09:52:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217791 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [09:53:04] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [09:53:43] (03Merged) 10jenkins-bot: product_metrics.contributors.experiments stream needs use_edge_uniques [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217783 (https://phabricator.wikimedia.org/T405177) (owner: 10DLynch) [09:54:10] (03Merged) 10jenkins-bot: mobileSectionSwitch: experiment name change [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1217791 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [09:54:46] (03CR) 10Btullis: [C:03+2] Revert webrequest related datasets retention [puppet] - 10https://gerrit.wikimedia.org/r/1217563 (https://phabricator.wikimedia.org/T412321) (owner: 10Joal) [09:54:54] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1217783|product_metrics.contributors.experiments stream needs use_edge_uniques (T405177 T410803)]], [[gerrit:1217791|mobileSectionSwitch: experiment name change (T410803)]] [09:54:59] T405177: Revise Tone: Instrumentation - https://phabricator.wikimedia.org/T405177 [09:55:00] T410803: Create data stream for mobile web section editing dead-end intervention - https://phabricator.wikimedia.org/T410803 [09:58:01] !log ihurbain@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: apply [09:58:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [10:03:25] (03CR) 10Effie Mouzeli: "It would be great if we could attach this to a task too 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217570 (owner: 10Jasmine) [10:08:59] !log ihurbain@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply [10:13:36] (03PS1) 10Blake: Revert^2 "service: add exclude_from_switchover field." [puppet] - 10https://gerrit.wikimedia.org/r/1218213 [10:16:20] (03PS2) 10Blake: Revert^2 "service: add exclude_from_switchover field." [puppet] - 10https://gerrit.wikimedia.org/r/1218213 [10:16:39] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: apply [10:17:35] (03PS2) 10Dpogorzelski: ml-build: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1218211 [10:17:49] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [10:19:35] (03CR) 10Elukey: [C:03+1] Revert^2 "service: add exclude_from_switchover field." [puppet] - 10https://gerrit.wikimedia.org/r/1218213 (owner: 10Blake) [10:19:55] (03CR) 10Blake: [C:03+2] Revert^2 "service: add exclude_from_switchover field." [puppet] - 10https://gerrit.wikimedia.org/r/1218213 (owner: 10Blake) [10:20:02] !log spicerack upgraded to 12.2.0 on cumin nodes [10:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:07] (03PS2) 10Tchanders: Add experimental temporary account creation rate limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217744 (https://phabricator.wikimedia.org/T412222) [10:22:20] (03CR) 10Tchanders: Add experimental temporary account creation rate limits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217744 (https://phabricator.wikimedia.org/T412222) (owner: 10Tchanders) [10:27:21] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply [10:27:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217744 (https://phabricator.wikimedia.org/T412222) (owner: 10Tchanders) [10:29:10] !log phuedx@deploy2002 kemayo, phuedx: Backport for [[gerrit:1217783|product_metrics.contributors.experiments stream needs use_edge_uniques (T405177 T410803)]], [[gerrit:1217791|mobileSectionSwitch: experiment name change (T410803)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:29:15] T405177: Revise Tone: Instrumentation - https://phabricator.wikimedia.org/T405177 [10:29:15] T410803: Create data stream for mobile web section editing dead-end intervention - https://phabricator.wikimedia.org/T410803 [10:29:54] Taking a look now /cc sfaci [10:30:52] thanks phuedx! [10:31:25] The config change for the stream is coming through [10:31:29] Checking the experiment [10:31:44] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: apply [10:32:50] Experiment change is coming through as well. No errors in the browser console [10:33:01] !log phuedx@deploy2002 kemayo, phuedx: Continuing with sync [10:33:06] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply [10:36:11] (03PS1) 10Btullis: Update the network policies that are deployed by the spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218219 (https://phabricator.wikimedia.org/T406833) [10:36:25] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: apply [10:38:06] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: apply [10:38:32] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:42:41] (03CR) 10Federico Ceratto: [C:03+1] "I updated the commit message and added a comment in the main tracking task" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:45:54] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217783|product_metrics.contributors.experiments stream needs use_edge_uniques (T405177 T410803)]], [[gerrit:1217791|mobileSectionSwitch: experiment name change (T410803)]] (duration: 50m 59s) [10:46:00] T405177: Revise Tone: Instrumentation - https://phabricator.wikimedia.org/T405177 [10:46:00] T410803: Create data stream for mobile web section editing dead-end intervention - https://phabricator.wikimedia.org/T410803 [10:46:13] sfaci: Those changes are deployed. I'll be watching the logs [10:46:24] (03PS1) 10Kosta Harlan: Add acct_creation_throttle_hit equivalent for temp. accounts [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218220 (https://phabricator.wikimedia.org/T412105) [10:46:47] (03PS1) 10Kosta Harlan: Add 'acct_creation_throttle_hit-temp' [extensions/WikimediaMessages] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218221 (https://phabricator.wikimedia.org/T412105) [10:47:14] (03CR) 10Btullis: [C:03+2] Update the network policies that are deployed by the spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218219 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [10:47:43] (03PS2) 10Kosta Harlan: Add 'acct_creation_throttle_hit-temp' [extensions/WikimediaMessages] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218221 (https://phabricator.wikimedia.org/T412105) [10:47:59] (03PS3) 10Kosta Harlan: Add 'acct_creation_throttle_hit-temp' [extensions/WikimediaMessages] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218221 (https://phabricator.wikimedia.org/T412105) [10:48:52] (03CR) 10Marostegui: "Probably also add the task number on the patch if that's ok." [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:49:40] (03CR) 10Marostegui: "The idea is to make sure we can track this backwards by looking at the code when the time comes." [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:55:00] (03PS7) 10Federico Ceratto: mysqld_exporter.pp: make /var/log/prometheus 0775 [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) [10:55:05] (03Merged) 10jenkins-bot: Update the network policies that are deployed by the spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218219 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [10:55:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:55:20] (03CR) 10Marostegui: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:56:13] (03CR) 10Federico Ceratto: "It was in the commit message but I also added a comment with task number near the permission setting in the script." [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:56:51] (03CR) 10Federico Ceratto: [C:03+2] mysqld_exporter.pp: make /var/log/prometheus 0775 [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:56:54] (03CR) 10Marostegui: [C:03+1] "Yeah, but it is easier if it is referenced in the code." [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:00:00] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1100) [11:00:50] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:03:39] !log Deploy schema change on x1 with replication https://phabricator.wikimedia.org/T412687 [11:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:33] phuedx Thank you very much! I have taken a look at the changes and everything looks good [11:06:33] jouncebot: nowandnext [11:06:33] For the next 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1100) [11:06:33] In 0 hour(s) and 53 minute(s): Verification email release (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1200) [11:06:55] can I sync some wmf.5 and config changes, or is someone deploying? [11:07:41] (03PS1) 10Kosta Harlan: Fix expected message on temp. account create limit [extensions/Wikibase] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218224 (https://phabricator.wikimedia.org/T412219) [11:09:44] (03PS1) 10Btullis: Add more entries to the spark-defaults.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218225 (https://phabricator.wikimedia.org/T406833) [11:10:30] marostegui: do you mind if I proceed? [11:10:37] kostajh: sure thing [11:10:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218224 (https://phabricator.wikimedia.org/T412219) (owner: 10Kosta Harlan) [11:10:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218220 (https://phabricator.wikimedia.org/T412105) (owner: 10Kosta Harlan) [11:10:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218221 (https://phabricator.wikimedia.org/T412105) (owner: 10Kosta Harlan) [11:10:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217744 (https://phabricator.wikimedia.org/T412222) (owner: 10Tchanders) [11:11:48] (03Merged) 10jenkins-bot: Add experimental temporary account creation rate limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217744 (https://phabricator.wikimedia.org/T412222) (owner: 10Tchanders) [11:12:49] (03PS1) 10Elukey: Pyrra: add the MWH completeness SLO under Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) [11:22:47] (03Merged) 10jenkins-bot: Add acct_creation_throttle_hit equivalent for temp. accounts [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218220 (https://phabricator.wikimedia.org/T412105) (owner: 10Kosta Harlan) [11:22:50] (03Merged) 10jenkins-bot: Add 'acct_creation_throttle_hit-temp' [extensions/WikimediaMessages] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218221 (https://phabricator.wikimedia.org/T412105) (owner: 10Kosta Harlan) [11:24:55] (03CR) 10Btullis: [C:03+2] Add more entries to the spark-defaults.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218225 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [11:27:09] (03Merged) 10jenkins-bot: Add more entries to the spark-defaults.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218225 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [11:30:42] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [11:30:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [11:31:01] (03CR) 10CI reject: [V:04-1] Fix expected message on temp. account create limit [extensions/Wikibase] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218224 (https://phabricator.wikimedia.org/T412219) (owner: 10Kosta Harlan) [11:34:12] (03CR) 10Kosta Harlan: [C:03+2] Fix expected message on temp. account create limit [extensions/Wikibase] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218224 (https://phabricator.wikimedia.org/T412219) (owner: 10Kosta Harlan) [11:36:01] (03CR) 10Kosta Harlan: [C:03+2] "recheck" [extensions/Wikibase] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218224 (https://phabricator.wikimedia.org/T412219) (owner: 10Kosta Harlan) [11:36:08] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1218220|Add acct_creation_throttle_hit equivalent for temp. accounts (T412105)]], [[gerrit:1218221|Add 'acct_creation_throttle_hit-temp' (T412105)]], [[gerrit:1217744|Add experimental temporary account creation rate limits (T412222)]] [11:36:14] T412105: Nudge temporary users who have hit the rate limit to create an account - https://phabricator.wikimedia.org/T412105 [11:36:15] T412222: Update temporary account creation rate limits - https://phabricator.wikimedia.org/T412222 [11:36:53] (03PS1) 10Aqu: Allow tests of canary events generation from airflow-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218232 (https://phabricator.wikimedia.org/T411989) [11:38:19] (03CR) 10Majavah: firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [11:39:19] (03CR) 10CI reject: [V:04-1] Allow tests of canary events generation from airflow-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218232 (https://phabricator.wikimedia.org/T411989) (owner: 10Aqu) [11:40:45] kostajh: note i have a window in ~20, hope that's fine :) [11:42:38] urbanecm: yes, I messaged you in Slack a while back -- I hope I should be done by then, although one of the patches failed which is delaying things a bit [11:42:59] ah, i didn't see that notif. [11:43:01] urbanecm: are you deploying a config patch then? [11:43:18] no, an i18n backport and a config patch (hence the window) [11:44:32] mainly because i got hit with T412265 before. but if your i18n build works, then _maybe_ so will mine [11:44:32] T412265: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265 [11:44:39] urbanecm: ack [11:45:14] urbanecm: note that https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1218224 will ride along with your changes, as that is the patch that temporarily failed (as it dependend on other patches that hadn't finished merging) [11:45:31] (03PS1) 10Muehlenhoff: Record LDAP access for ggalofre [puppet] - 10https://gerrit.wikimedia.org/r/1218234 [11:45:38] gotcha. no idea how to test that one though. [11:46:02] urbanecm: you don't need to, the test is if CI passes :) [11:46:24] ah, i thought it changes sth in prod too, [11:52:09] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for ggalofre [puppet] - 10https://gerrit.wikimedia.org/r/1218234 (owner: 10Muehlenhoff) [11:56:01] !log urbanecm@deploy2002 mwscript-k8s job started: GrowthExperiments::fixLinkRecommendationData --wiki=itwiki --dry-run --search-index --db-table # T412040-fix-dryrun [11:56:05] T412040: Add a Link: repopulate "Add a Link" suggestions for itwiki - https://phabricator.wikimedia.org/T412040 [11:56:10] !log Deploy schema change on x1 with replication https://phabricator.wikimedia.org/T410241 [11:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:21] ahh, only one colon... [11:57:34] !log urbanecm@deploy2002 mwscript-k8s job started: GrowthExperiments:fixLinkRecommendationData --wiki=itwiki --dry-run --search-index --db-table # T412040-fix-dryrun [11:57:50] (03PS1) 10Marostegui: filtered_tables.txt: Add new column [puppet] - 10https://gerrit.wikimedia.org/r/1218238 (https://phabricator.wikimedia.org/T410241) [11:59:19] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Add new column [puppet] - 10https://gerrit.wikimedia.org/r/1218238 (https://phabricator.wikimedia.org/T410241) (owner: 10Marostegui) [11:59:23] (03Merged) 10jenkins-bot: Fix expected message on temp. account create limit [extensions/Wikibase] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218224 (https://phabricator.wikimedia.org/T412219) (owner: 10Kosta Harlan) [12:00:05] urbanecm: Your horoscope predicts another Verification email release deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1200). [12:00:15] my horoscope says to wait for kostajh's go ahead [12:00:39] hehe [12:01:03] you could probably +2 your changes now to save some time, but might be safer to wait a little while longer [12:01:11] These horoscopes are unusually specific [12:01:42] kostajh: i'd defer to you. depends on where the scap is at. [12:01:51] building the image still [12:02:18] I don't anticipate needing to revert the patches. But I guess even if I did, the revert could be deployed along with your verification email changes [12:03:02] kostajh: the problem is prior deployment of my changes has errored in the middle of image build [12:03:06] so i'd prefer if they rode alone [12:03:33] ok, I'll let you know as soon as I'm done. Probably another 20 minutes, I'd guess, for image building to finish [12:04:17] good news is that if it fails for _you_, then i'm 99% certain the issue i was running into is specific to doing an i18n build. [12:06:23] !log urbanecm@deploy2002 mwscript-k8s job started: GrowthExperiments:fixLinkRecommendationData --wiki=itwiki --force --search-index --db-table # T412040-fix [12:06:27] T412040: Add a Link: repopulate "Add a Link" suggestions for itwiki - https://phabricator.wikimedia.org/T412040 [12:09:12] (03CR) 10A-pizzata: Pyrra: add the MWH completeness SLO under Data Platform (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) (owner: 10Elukey) [12:10:14] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:13:22] (03PS1) 10Muehlenhoff: aptrepo: Remove HPE configs [puppet] - 10https://gerrit.wikimedia.org/r/1218244 [12:13:52] (03CR) 10CI reject: [V:04-1] aptrepo: Remove HPE configs [puppet] - 10https://gerrit.wikimedia.org/r/1218244 (owner: 10Muehlenhoff) [12:18:21] (03PS2) 10Muehlenhoff: aptrepo: Remove HPE configs [puppet] - 10https://gerrit.wikimedia.org/r/1218244 [12:18:44] (03PS1) 10Effie Mouzeli: mcrouter: update chart metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218245 (https://phabricator.wikimedia.org/T412693) [12:21:59] still waiting on the images to build [12:24:22] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: update chart metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218245 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [12:25:25] jouncebot: nowandnext [12:25:25] For the next 1 hour(s) and 34 minute(s): Verification email release (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1200) [12:25:25] In 1 hour(s) and 34 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1400) [12:26:04] (03Merged) 10jenkins-bot: mcrouter: update chart metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218245 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [12:27:16] ihurbain: I'm deploying now [12:27:20] and urbanecm is next [12:27:49] * urbanecm might not have enough time left though :/ [12:27:54] ack :) (i'm in no hurry, it was mostly a "eh, if it's free i'll attempt something" kind of poke) [12:28:06] we'll see [12:28:34] "Waiting 300 seconds for swift" seems promising though! :) [12:30:32] !log urbanecm@deploy2002 mwscript-k8s job started: GrowthExperiments:fixLinkRecommendationData --wiki=itwiki --force --search-index --db-table # T412040-fix-02 [12:30:36] T412040: Add a Link: repopulate "Add a Link" suggestions for itwiki - https://phabricator.wikimedia.org/T412040 [12:31:31] !log kharlan@deploy2002 kharlan, tchanders: Backport for [[gerrit:1218220|Add acct_creation_throttle_hit equivalent for temp. accounts (T412105)]], [[gerrit:1218221|Add 'acct_creation_throttle_hit-temp' (T412105)]], [[gerrit:1217744|Add experimental temporary account creation rate limits (T412222)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:31:36] T412105: Nudge temporary users who have hit the rate limit to create an account - https://phabricator.wikimedia.org/T412105 [12:31:37] T412222: Update temporary account creation rate limits - https://phabricator.wikimedia.org/T412222 [12:35:50] !log kharlan@deploy2002 kharlan, tchanders: Continuing with sync [12:44:28] (03PS1) 10Effie Mouzeli: cxserver: update chart metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218270 (https://phabricator.wikimedia.org/T412693) [12:48:31] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218220|Add acct_creation_throttle_hit equivalent for temp. accounts (T412105)]], [[gerrit:1218221|Add 'acct_creation_throttle_hit-temp' (T412105)]], [[gerrit:1217744|Add experimental temporary account creation rate limits (T412222)]] (duration: 72m 23s) [12:48:37] T412105: Nudge temporary users who have hit the rate limit to create an account - https://phabricator.wikimedia.org/T412105 [12:48:37] T412222: Update temporary account creation rate limits - https://phabricator.wikimedia.org/T412222 [12:49:18] urbanecm: ok, done now [12:49:53] perf [12:49:56] (03PS3) 10Dpogorzelski: ml-build: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1218211 [12:51:56] (03PS2) 10Effie Mouzeli: cxserver: update chart metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218270 (https://phabricator.wikimedia.org/T412693) [12:52:03] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:52:36] !log imported jenkins 2.528.3 to thirdparty/ci for bullseye-wikimedia T412694 [12:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:53] !log imported jenkins 2.528.3 to thirdparty/jenkins for bookworm-wikimedia T412694 [12:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:56] (03CR) 10Dpogorzelski: "prune_prod_images" [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [12:56:26] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [12:58:01] (03PS1) 10Btullis: Add a dbt profiles.yml file to the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218271 (https://phabricator.wikimedia.org/T410017) [13:01:05] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1218273 [13:01:47] (03CR) 10Elukey: [C:03+1] aptrepo: Remove HPE configs [puppet] - 10https://gerrit.wikimedia.org/r/1218244 (owner: 10Muehlenhoff) [13:01:51] (03PS1) 10Urbanecm: Revert^4 "Confirmation email: further styling adjustments" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218274 (https://phabricator.wikimedia.org/T411526) [13:01:59] (03CR) 10Urbanecm: [C:03+2] Revert^4 "Confirmation email: further styling adjustments" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218274 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [13:02:22] (03PS1) 10Urbanecm: Revert^4 "i18n: replace <> to avoid false positive export errors" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218275 (https://phabricator.wikimedia.org/T411526) [13:02:31] (03CR) 10Urbanecm: [C:03+2] Revert^4 "i18n: replace <> to avoid false positive export errors" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218275 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [13:03:07] (03CR) 10Elukey: [C:03+1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1218273 (owner: 10Muehlenhoff) [13:04:01] (03CR) 10Btullis: [C:03+2] Add a dbt profiles.yml file to the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218271 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [13:04:04] (03CR) 10Muehlenhoff: [C:03+2] aptrepo: Remove HPE configs [puppet] - 10https://gerrit.wikimedia.org/r/1218244 (owner: 10Muehlenhoff) [13:05:48] (03Merged) 10jenkins-bot: Add a dbt profiles.yml file to the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218271 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [13:06:58] (03PS1) 10Ladsgroup: Set wgExternalLinksIgnoreDomains in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218276 (https://phabricator.wikimedia.org/T405005) [13:07:59] jouncebot: nowandnext [13:07:59] For the next 0 hour(s) and 52 minute(s): Verification email release (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1200) [13:07:59] In 0 hour(s) and 52 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1400) [13:10:52] Amir1: still deploying :) [13:11:00] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [13:11:03] let me know when you're done <3 [13:11:07] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [13:11:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218274 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [13:11:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218275 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [13:11:45] Amir1: will do. although that might be when the window starts. [13:15:50] no worries [13:16:01] (03Merged) 10jenkins-bot: Revert^4 "Confirmation email: further styling adjustments" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218274 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [13:16:41] (03Merged) 10jenkins-bot: Revert^4 "i18n: replace <> to avoid false positive export errors" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218275 (https://phabricator.wikimedia.org/T411526) (owner: 10Urbanecm) [13:18:26] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1218274|Revert^4 "Confirmation email: further styling adjustments" (T411526)]], [[gerrit:1218275|Revert^4 "i18n: replace <> to avoid false positive export errors" (T411526)]] [13:18:30] T411526: Improve CSS styling for verification email - https://phabricator.wikimedia.org/T411526 [13:26:57] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1218273 (owner: 10Muehlenhoff) [13:27:35] (03PS1) 10Majavah: P:toolforge::prometheus: Update kube-state-metrics selector [puppet] - 10https://gerrit.wikimedia.org/r/1218278 (https://phabricator.wikimedia.org/T412506) [13:28:43] !log dropped databases of deleted wikis in x1 (T411835) [13:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:47] T411835: Investigate unusal dbs in x1 - https://phabricator.wikimedia.org/T411835 [13:29:49] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Update kube-state-metrics selector [puppet] - 10https://gerrit.wikimedia.org/r/1218278 (https://phabricator.wikimedia.org/T412506) (owner: 10Majavah) [13:34:07] (03CR) 10Aklapper: [C:03+2] Rm .arcconfig [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217881 (https://phabricator.wikimedia.org/T412652) (owner: 10Pppery) [13:34:12] (03CR) 10Aklapper: [V:03+2 C:03+2] Rm .arcconfig [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217881 (https://phabricator.wikimedia.org/T412652) (owner: 10Pppery) [13:36:29] (03Abandoned) 10Ahonc: Change votewiki language to Ukrainian. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217760 (https://phabricator.wikimedia.org/T412521) (owner: 10Ahonc) [13:41:41] !log hashar@deploy2002 Started deploy [releng/jenkins-deploy@7a991f5] (releasing): Redeploy to upgrade releases Jenkins T412694 [13:42:53] !log hashar@deploy2002 Finished deploy [releng/jenkins-deploy@7a991f5] (releasing): Redeploy to upgrade releases Jenkins T412694 (duration: 02m 05s) [13:43:25] (03CR) 10JMeybohm: "Should we add a separate probe that only checks staging then?" [alerts] - 10https://gerrit.wikimedia.org/r/1217107 (owner: 10Elukey) [13:45:13] (03CR) 10Elukey: "Sure if needed yes, I thought it was something that we didn't really care/want-to-alarm-on for this use case. In any case, I'd proceed wit" [alerts] - 10https://gerrit.wikimedia.org/r/1217107 (owner: 10Elukey) [13:45:38] (03CR) 10Aklapper: [V:03+2 C:03+2] Suppress a spurious "File does not exist: " error at the end of string generation [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217880 (https://phabricator.wikimedia.org/T412652) (owner: 10Pppery) [13:46:36] (03CR) 10Aklapper: [V:03+2 C:03+2] Sort list of file references [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217879 (https://phabricator.wikimedia.org/T412649) (owner: 10Pppery) [13:46:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:51:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:51:54] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1218274|Revert^4 "Confirmation email: further styling adjustments" (T411526)]], [[gerrit:1218275|Revert^4 "i18n: replace <> to avoid false positive export errors" (T411526)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:51:58] T411526: Improve CSS styling for verification email - https://phabricator.wikimedia.org/T411526 [13:54:07] !log urbanecm@deploy2002 urbanecm: Continuing with sync [13:58:59] (03CR) 10Urbanecm: Logos: Destandardize thumbnail sizes, handle missing responsive URLs (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1400). [14:00:05] anzx and ihurbain: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] o/ [14:00:20] hoping to have the scap finished soon [14:00:29] ack [14:00:38] anzx: i posted some questions on your patch, do you know the answers? [14:00:47] checking [14:00:54] ihurbain: i assume you plan to self-deploy? [14:01:10] i have no plan but yes i can do that :D [14:01:23] happy either way, just asking :D [14:01:37] i'll ping you when done [14:01:45] i'll do it, spiderpig muscle stretching is a good thing :D [14:01:53] and it makes it much easier, right? [14:01:57] yes [14:02:11] * urbanecm vividly remembers thinking about sync orders [14:02:19] well it makes one thing hard. i have an awesome spiderpig sticker and no space on my laptop. it's a problem. [14:02:28] where'd you get that? [14:02:35] offsite in lisbon last week [14:02:39] ahh, makes sense [14:02:54] we should have sent one to you via otto :D [14:03:18] would've worked :). well, there's always next time! [14:03:20] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [14:03:24] (03PS1) 10Muehlenhoff: Revert "svg: refuse to generate SVGs larger than a particular size" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1218284 (https://phabricator.wikimedia.org/T411076) [14:03:35] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11460277 (10ops-monitoring-bot) Host gitlab2002.wikimedia.org rebooted by jelto@cumin1003 with reason: maintenance reboot for new subnets [14:03:39] (my scap is at 50%) [14:05:10] (03CR) 10JMeybohm: "I think adding staging services to service-catalog wasn't really a use case until now. If it is, we should IMHO apply the same rules as fo" [alerts] - 10https://gerrit.wikimedia.org/r/1217107 (owner: 10Elukey) [14:05:21] (03PS5) 10Urbanecm: enwiki: Enable HTML confirmation email [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211810 (https://phabricator.wikimedia.org/T410970) [14:05:24] (03PS3) 10Urbanecm: Enable HTML confirmation email on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211813 (https://phabricator.wikimedia.org/T410971) [14:05:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211810 (https://phabricator.wikimedia.org/T410970) (owner: 10Urbanecm) [14:05:34] (03PS1) 10Volans: [draft] service: allow to ignore dataclass extra arguments [software/spicerack] - 10https://gerrit.wikimedia.org/r/1218285 (https://phabricator.wikimedia.org/T412700) [14:05:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211813 (https://phabricator.wikimedia.org/T410971) (owner: 10Urbanecm) [14:06:18] !log installing libpng1.6 security updates [14:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:22] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218274|Revert^4 "Confirmation email: further styling adjustments" (T411526)]], [[gerrit:1218275|Revert^4 "i18n: replace <> to avoid false positive export errors" (T411526)]] (duration: 47m 56s) [14:06:26] T411526: Improve CSS styling for verification email - https://phabricator.wikimedia.org/T411526 [14:06:28] and here we are! [14:06:40] ihurbain: over to you. please ping when done, i'll take a look at anzx's patches. [14:06:47] ok! [14:06:53] spiderpig here i come! [14:07:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217768 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [14:08:06] (03Merged) 10jenkins-bot: Activate post-processing cache on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217768 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [14:08:25] !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1217768|Activate post-processing cache on idwiki (T348255)]] [14:08:31] T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255 [14:09:50] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [14:10:03] FIRING: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:01] (03CR) 10CI reject: [V:04-1] Revert "svg: refuse to generate SVGs larger than a particular size" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1218284 (https://phabricator.wikimedia.org/T411076) (owner: 10Muehlenhoff) [14:12:30] !log ihurbain@deploy2002 ihurbain: Backport for [[gerrit:1217768|Activate post-processing cache on idwiki (T348255)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:13:34] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218290 [14:14:00] urbanecm: since script earlier assumes standard thumnail sizes, I think should have used `get("2", thumburl)` same as 1.5px , i think ppery used it because only 1.5px images were failed to generate [14:14:06] !log ihurbain@deploy2002 ihurbain: Continuing with sync [14:14:48] * urbanecm is still not sure he understands the logic behind the patch [14:15:03] RESOLVED: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:20] (03CR) 10Volans: "This is just to show you a possible approach." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1218285 (https://phabricator.wikimedia.org/T412700) (owner: 10Volans) [14:20:32] !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217768|Activate post-processing cache on idwiki (T348255)]] (duration: 12m 07s) [14:20:37] T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255 [14:21:02] urbanecm: my scap is done [14:21:05] perf [14:21:45] anzx: i'm still not sure i understand the explanations in the patch. the problem makes sense to me on the task, but the code-level comments in the patch not so much [14:22:00] let's do the logo change and leave this for later, when the pathc author has a chance to respond? [14:22:46] (03CR) 10Urbanecm: [C:04-1] "-1, as i don't understand what the comments are saying on multiple reads. the problem is clear to me from the task, but the comments feel " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [14:22:56] (03PS4) 10Anzx: niawiktionary: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217522 (https://phabricator.wikimedia.org/T411850) [14:23:45] (03PS10) 10Volans: interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 [14:25:24] urbanecm: ok, i think once we generate logo , it shouldn't rely on manage.py to be merged before [14:25:29] indeed [14:25:34] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218293 [14:25:35] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218294 [14:25:39] (03CR) 10Ladsgroup: "Note that we are planning to rate limit non-standard sizes. You might not be able to hit those endpoints at all." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [14:25:47] sorry about that, but i'd prefer not merge something i don't understand [14:26:12] ok that's reasonable [14:26:21] (03CR) 10Urbanecm: [C:03+2] niawiktionary: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217522 (https://phabricator.wikimedia.org/T411850) (owner: 10Anzx) [14:26:24] CI is green, so let's proceed [14:26:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217522 (https://phabricator.wikimedia.org/T411850) (owner: 10Anzx) [14:27:10] (03Merged) 10jenkins-bot: niawiktionary: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217522 (https://phabricator.wikimedia.org/T411850) (owner: 10Anzx) [14:27:30] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1217522|niawiktionary: update logo (T411850)]] [14:27:34] T411850: nia.wiktionary - Rename namespace Wiktionary into Wikikamus - https://phabricator.wikimedia.org/T411850 [14:29:32] !log urbanecm@deploy2002 urbanecm, anzx: Backport for [[gerrit:1217522|niawiktionary: update logo (T411850)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:29:38] anzx: can you test? [14:30:05] urbanecm: Is that okay if I +2 a couple of backports since it'll take a long time to merge [14:30:29] (03PS1) 10Ladsgroup: SpecialLinkSearch: Add a message when domains are being ignored [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218295 (https://phabricator.wikimedia.org/T405005) [14:30:31] Amir1: i have two remaining config patches, maybe once they're started? [14:30:40] urbanecm: logo change appears correctly, ok to sync [14:30:46] !log urbanecm@deploy2002 urbanecm, anzx: Continuing with sync [14:30:52] (03CR) 10Elukey: "okok that is fine to me, but IIUC we'd need another probe and I thought a new one could be added afterwards, to avoid the current false po" [alerts] - 10https://gerrit.wikimedia.org/r/1217107 (owner: 10Elukey) [14:30:55] (03CR) 10Urbanecm: [C:03+2] enwiki: Enable HTML confirmation email [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211810 (https://phabricator.wikimedia.org/T410970) (owner: 10Urbanecm) [14:30:56] (03CR) 10Urbanecm: [C:03+2] Enable HTML confirmation email on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211813 (https://phabricator.wikimedia.org/T410971) (owner: 10Urbanecm) [14:30:56] sure! [14:31:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1184044 (owner: 10Aklapper) [14:31:52] (03Merged) 10jenkins-bot: enwiki: Enable HTML confirmation email [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211810 (https://phabricator.wikimedia.org/T410970) (owner: 10Urbanecm) [14:31:56] (03Merged) 10jenkins-bot: Enable HTML confirmation email on Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211813 (https://phabricator.wikimedia.org/T410971) (owner: 10Urbanecm) [14:32:30] (03PS1) 10Ladsgroup: ParserOutput: Allow for ignoring a set of domains for externallinks [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218296 (https://phabricator.wikimedia.org/T405005) [14:34:57] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217522|niawiktionary: update logo (T411850)]] (duration: 07m 27s) [14:35:01] T411850: nia.wiktionary - Rename namespace Wiktionary into Wikikamus - https://phabricator.wikimedia.org/T411850 [14:35:04] here we go [14:35:05] anzx: done [14:35:28] Amir1: feel free to +2 yours now [14:35:33] cooool [14:35:37] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1211810|enwiki: Enable HTML confirmation email (T410970)]], [[gerrit:1211813|Enable HTML confirmation email on Wikidata and Commons (T410971)]] [14:35:40] (03CR) 10Ladsgroup: [C:03+2] ParserOutput: Allow for ignoring a set of domains for externallinks [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218296 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [14:35:42] T410970: Improve verification email: release to English Wikipedia - https://phabricator.wikimedia.org/T410970 [14:35:42] T410971: Improve verification email: release to Commons and Wikidata - https://phabricator.wikimedia.org/T410971 [14:36:01] (03PS3) 10Jbond: interface::alias: update define to get prefix len from netmask [puppet] - 10https://gerrit.wikimedia.org/r/931237 (https://phabricator.wikimedia.org/T336864) [14:36:48] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:37:40] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1211810|enwiki: Enable HTML confirmation email (T410970)]], [[gerrit:1211813|Enable HTML confirmation email on Wikidata and Commons (T410971)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:37:40] urbanecm: thanks for deploying, logo change looks correct with turning off debug [14:38:32] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:36] perfect [14:38:41] (03CR) 10Pppery: "This is not a hot script. It's run once manually every time a wiki logo is changed (and accesses 3 nonstandard sizes of 1 file). Unless th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [14:40:48] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7820/console" [puppet] - 10https://gerrit.wikimedia.org/r/931237 (https://phabricator.wikimedia.org/T336864) (owner: 10Jbond) [14:41:41] lovely. this means it works! :) https://usercontent.irccloud-cdn.com/file/7tPJRtfQ/image.png [14:41:43] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:42:49] 10ops-codfw, 06SRE, 06DC-Ops: Outbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T407556#11460497 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:43:29] (03CR) 10CI reject: [V:04-1] SpecialLinkSearch: Add a message when domains are being ignored [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218295 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [14:44:44] (03CR) 10Ayounsi: "Interested in your feedback before I spend more time on it." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218209 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [14:45:50] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211810|enwiki: Enable HTML confirmation email (T410970)]], [[gerrit:1211813|Enable HTML confirmation email on Wikidata and Commons (T410971)]] (duration: 10m 13s) [14:45:56] T410970: Improve verification email: release to English Wikipedia - https://phabricator.wikimedia.org/T410970 [14:45:56] T410971: Improve verification email: release to Commons and Wikidata - https://phabricator.wikimedia.org/T410971 [14:47:10] and that should be it [14:47:18] Amir1: over to you (once the CI gets merciful on you) [14:47:59] (03CR) 10JMeybohm: [C:03+1] "Sure! Your credibility is high enough to merge this rn and follow up after 😄" [alerts] - 10https://gerrit.wikimedia.org/r/1217107 (owner: 10Elukey) [14:48:31] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11460526 (10Jclark-ctr) [14:48:33] (03Merged) 10jenkins-bot: ParserOutput: Allow for ignoring a set of domains for externallinks [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218296 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [14:54:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T410589)', diff saved to https://phabricator.wikimedia.org/P86610 and previous config saved to /var/cache/conftool/dbconfig/20251215-145408-ladsgroup.json [14:54:13] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [14:55:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:56:55] (03PS4) 10Dpogorzelski: ml-build: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1218211 [14:57:53] urbanecm: Thank you! [14:58:06] (03PS5) 10Dpogorzelski: ml-build: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1218211 [14:58:17] RESOLVED: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:58:22] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1218296|ParserOutput: Allow for ignoring a set of domains for externallinks (T405005)]] [14:58:26] T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005 [14:58:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [14:58:30] (03PS6) 10Dpogorzelski: ml-build: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1218211 [14:58:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2148 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86611 and previous config saved to /var/cache/conftool/dbconfig/20251215-145836-marostegui.json [14:58:44] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:58:45] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [14:59:10] (03PS7) 10Dpogorzelski: ml-build: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1218211 [14:59:19] (03PS1) 10Hamish: zhwiki: enable protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218298 (https://phabricator.wikimedia.org/T412710) [14:59:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool db2148', diff saved to https://phabricator.wikimedia.org/P86612 and previous config saved to /var/cache/conftool/dbconfig/20251215-145943-marostegui.json [15:00:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [15:00:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2150 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86613 and previous config saved to /var/cache/conftool/dbconfig/20251215-150027-marostegui.json [15:00:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:01:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:01:10] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1218296|ParserOutput: Allow for ignoring a set of domains for externallinks (T405005)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:01:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1158 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86614 and previous config saved to /var/cache/conftool/dbconfig/20251215-150111-marostegui.json [15:02:43] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:03:27] (03PS1) 10Hamish: svwiki: lift autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218302 (https://phabricator.wikimedia.org/T412713) [15:03:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218298 (https://phabricator.wikimedia.org/T412710) (owner: 10Hamish) [15:04:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218302 (https://phabricator.wikimedia.org/T412713) (owner: 10Hamish) [15:06:48] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218296|ParserOutput: Allow for ignoring a set of domains for externallinks (T405005)]] (duration: 08m 26s) [15:06:52] T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005 [15:09:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P86615 and previous config saved to /var/cache/conftool/dbconfig/20251215-150916-ladsgroup.json [15:09:40] 06SRE, 06Infrastructure-Foundations: nodesource node20 apt mirror is broken - https://phabricator.wikimedia.org/T412342#11460633 (10elukey) 05Open→03Resolved Fixed last week :) [15:10:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:14:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11460664 (10Jclark-ctr) p:05Triage→03Medium a:03Jclark-ctr [15:15:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11460667 (10Jclark-ctr) @cmooney i have moved all Scs cables from nokia switches to Juniper switches in rows c/d [15:15:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11460668 (10Jclark-ctr) a:05Jclark-ctr→03cmooney [15:16:41] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 74 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/931237 (https://phabricator.wikimedia.org/T336864) (owner: 10Jbond) [15:20:03] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11460682 (10elukey) p:05Triage→03Medium [15:20:20] (03CR) 10Cathal Mooney: [C:03+1] "Seems sane yep, it potentially could be abused so +1" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218210 (owner: 10Ayounsi) [15:21:14] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11460683 (10elukey) @Jclark-ctr thanks a lot! I am going to review it tomorrow, let me know if there is a server that I can test it w... [15:22:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Juniper EX/QFX switches in eqiad rows C/D - https://phabricator.wikimedia.org/T412271#11460722 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Alll 4 links have been removed physically and deleted from netbox. Scs c... [15:24:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P86616 and previous config saved to /var/cache/conftool/dbconfig/20251215-152425-ladsgroup.json [15:24:46] (03PS2) 10Aqu: Allow tests of canary events generation from airflow-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218232 (https://phabricator.wikimedia.org/T411989) [15:25:56] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11460752 (10Jclark-ctr) @elukey no problem. I didn’t make any edits to the provisioning script it was just my observation. After revi... [15:26:26] (03CR) 10FNegri: [C:03+2] P:toolforge:prometheus: scrape mariadb metrics [puppet] - 10https://gerrit.wikimedia.org/r/1215121 (https://phabricator.wikimedia.org/T410505) (owner: 10FNegri) [15:27:23] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11460761 (10Jclark-ctr) T408760 Is the ticket that is pending to be racked at eqiad [15:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1530) [15:33:45] anything happening? If not, I'm going to deploy some stuff [15:33:51] (03CR) 10Ladsgroup: [C:03+2] Set wgExternalLinksIgnoreDomains in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218276 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [15:34:25] (03PS1) 10Ladsgroup: SpecialLinkSearch: Move ignored-domains msg to bottom [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218314 (https://phabricator.wikimedia.org/T405005) [15:34:47] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7822/consol" [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans) [15:34:53] (03Merged) 10jenkins-bot: Set wgExternalLinksIgnoreDomains in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218276 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [15:35:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:07] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218276 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [15:36:23] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1218276|Set wgExternalLinksIgnoreDomains in production (T405005)]] [15:36:28] T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005 [15:38:22] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1218276|Set wgExternalLinksIgnoreDomains in production (T405005)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:38:36] 10ops-codfw, 07sre-alert-triage, 06DC-Ops, 06Infrastructure-Foundations: Alert in need of triage: SmartNotHealthy (instance sretest2006:9100) - https://phabricator.wikimedia.org/T412078#11460813 (10LSobanski) [15:39:01] !log Upgrading CI Jenkins [15:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:08] (03CR) 10Jelto: [V:03+1 C:03+1] "pcc against a subset of `'R(interface::alias)'` hosts looks reasonable to me. This change would help unblock gitlab subnet mismatch T37001" [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans) [15:39:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T410589)', diff saved to https://phabricator.wikimedia.org/P86617 and previous config saved to /var/cache/conftool/dbconfig/20251215-153933-ladsgroup.json [15:39:37] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [15:39:49] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [15:39:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T410589)', diff saved to https://phabricator.wikimedia.org/P86618 and previous config saved to /var/cache/conftool/dbconfig/20251215-153957-ladsgroup.json [15:40:05] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Offboarding for joelyrookewmde - https://phabricator.wikimedia.org/T412508#11460822 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [15:40:29] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1217810 (https://phabricator.wikimedia.org/T408168) (owner: 10Dzahn) [15:40:31] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:41:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11460843 (10VRiley-WMF) [15:44:43] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218276|Set wgExternalLinksIgnoreDomains in production (T405005)]] (duration: 08m 20s) [15:44:47] T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005 [15:44:53] 06SRE, 06Infrastructure-Foundations, 10Mail: wmf-auto-restart: fails for exim4 - https://phabricator.wikimedia.org/T330660#11460854 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Current Exim packages ship a systemd unit, which resolved this. [15:45:00] (03CR) 10Ladsgroup: "I know it's not hot but since rate limits will be somewhat draconian, you're gonna hit them really easily since you'd be asking for three " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [15:47:07] (03CR) 10Pppery: "This doesn't generate thousands of images. The typical process for adding new logos only runs it for one file (hence three thumbnails) at " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [15:47:31] (03PS1) 10Elukey: sre.hosts.provision: fix retry logic for the Supermicro BMC password [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) [15:49:29] 06SRE, 06Infrastructure-Foundations: Provide an option menu when booting via PXE - https://phabricator.wikimedia.org/T191018#11460870 (10jhathaway) p:05Medium→03Low [15:49:30] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06serviceops, 07Datacenter-Switchover: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11460869 (10LSobanski) @Clement_Goubert is this still needed? [15:50:25] 10SRE-tools, 06Infrastructure-Foundations: Create an offline cookbook to take care of additional offline steps - https://phabricator.wikimedia.org/T335431#11460871 (10LSobanski) p:05Medium→03Low [15:51:27] (03CR) 10Hnowlan: Revert "svg: refuse to generate SVGs larger than a particular size" (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1218284 (https://phabricator.wikimedia.org/T411076) (owner: 10Muehlenhoff) [15:51:41] (03PS1) 10Btullis: Add a spark-toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218318 (https://phabricator.wikimedia.org/T406833) [15:51:50] (03CR) 10CI reject: [V:04-1] Add a spark-toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218318 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:51:52] (03PS1) 10Jakob: Use relative path for "latest" symlinks [dumps] - 10https://gerrit.wikimedia.org/r/1218317 (https://phabricator.wikimedia.org/T412726) [15:53:06] (03CR) 10Muehlenhoff: [C:03+2] offboard-user: Remove some outdated privileged Phabricator projects [puppet] - 10https://gerrit.wikimedia.org/r/1184044 (owner: 10Aklapper) [15:55:17] !log hashar@deploy2002 Started deploy [releng/jenkins-deploy@863e5c2] (releasing): Update git-client plugin # T412694 [15:56:00] (03PS2) 10Btullis: Add a spark-toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218318 (https://phabricator.wikimedia.org/T406833) [15:56:12] !log hashar@deploy2002 Finished deploy [releng/jenkins-deploy@863e5c2] (releasing): Update git-client plugin # T412694 (duration: 01m 19s) [15:56:22] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 2 others: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11460913 (10elukey) @Jclark-ctr @VRiley-WMF I filed a patch to test, please let me know when you have a new wikikube worker configured in netbox and re... [15:57:27] (03CR) 10Elukey: "Haven't tested it yet, John and Valerie will give me some new hosts during the next days but lemme know if the logic looks ok or not. I ke" [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) (owner: 10Elukey) [15:59:29] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 2 others: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11460922 (10VRiley-WMF) @elukey Will do, I'm working on them now. [16:06:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [16:07:01] !incidents [16:07:01] 7194 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013} xe-3/2/1 gnmi eqiad) [16:07:06] !ack 7194 [16:07:06] 7194 (ACKED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013} xe-3/2/1 gnmi eqiad) [16:10:15] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:11:44] (03CR) 10Btullis: [C:03+2] Add a spark-toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218318 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [16:12:23] (03PS3) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) [16:13:10] (03CR) 10CI reject: [V:04-1] Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [16:13:32] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:13:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86619 and previous config saved to /var/cache/conftool/dbconfig/20251215-161335-marostegui.json [16:13:41] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:13:42] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [16:13:49] (03Merged) 10jenkins-bot: Add a spark-toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218318 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [16:15:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11461007 (10VRiley-WMF) We currently are out of space in row A for 8 - 10 Gig connections due to power. I have notified @Clement_Goubert to see if we might be able to move t... [16:16:37] (03PS4) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) [16:17:27] (03CR) 10CI reject: [V:04-1] Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [16:17:52] (03PS5) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) [16:17:56] (03PS2) 10Alexandros Kosiaris: Update fc-list to point to fc-list Tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216750 (https://phabricator.wikimedia.org/T280718) [16:18:01] (03CR) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [16:21:47] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733 (10cmooney) 03NEW p:05Triage→03Medium [16:21:54] 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface lsw1-f4-codfw:mgmt0 () - https://phabricator.wikimedia.org/T412153#11461052 (10cmooney) [16:21:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216750 (https://phabricator.wikimedia.org/T280718) (owner: 10Alexandros Kosiaris) [16:22:00] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11461053 (10cmooney) [16:23:23] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [16:23:33] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [16:23:48] 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface lsw1-e4-codfw:mgmt0 () - https://phabricator.wikimedia.org/T412152#11461063 (10cmooney) [16:23:49] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11461064 (10cmooney) [16:23:53] 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface lsw1-e2-codfw:mgmt0 () - https://phabricator.wikimedia.org/T412155#11461065 (10cmooney) [16:23:55] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11461066 (10cmooney) [16:24:00] 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface lsw1-f2-codfw:mgmt0 () - https://phabricator.wikimedia.org/T412154#11461067 (10cmooney) [16:24:01] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11461068 (10cmooney) [16:24:09] 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface lsw1-e5-codfw:mgmt0 () - https://phabricator.wikimedia.org/T412156#11461069 (10cmooney) [16:24:16] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11461070 (10cmooney) [16:28:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P86620 and previous config saved to /var/cache/conftool/dbconfig/20251215-162844-marostegui.json [16:30:04] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1630). nyaa~ [16:36:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [16:43:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P86621 and previous config saved to /var/cache/conftool/dbconfig/20251215-164353-marostegui.json [16:44:42] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS bookworm [16:50:27] (03CR) 10Dzahn: [C:03+2] ncredir: add redirects for www.wikipedia25.(org|com) [puppet] - 10https://gerrit.wikimedia.org/r/1217810 (https://phabricator.wikimedia.org/T408168) (owner: 10Dzahn) [16:51:07] (03PS1) 10Btullis: Remove unnecessary prefix from the configmaps and secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218328 (https://phabricator.wikimedia.org/T406833) [16:51:20] (03PS2) 10Btullis: Remove unnecessary prefix from the configmaps and secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218328 (https://phabricator.wikimedia.org/T406833) [16:54:22] (03CR) 10Btullis: [C:03+2] Remove unnecessary prefix from the configmaps and secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218328 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [16:56:45] (03Merged) 10jenkins-bot: Remove unnecessary prefix from the configmaps and secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218328 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [16:56:50] 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11461202 (10dancy) Here are pretrain image build and push times for the last several days: Unless otherwise noted, full l10n rebuild occurred for each of these im... [16:59:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86622 and previous config saved to /var/cache/conftool/dbconfig/20251215-165901-marostegui.json [16:59:08] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:59:09] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [16:59:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:59:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1170 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86623 and previous config saved to /var/cache/conftool/dbconfig/20251215-165926-marostegui.json [16:59:35] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [16:59:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [16:59:56] (03PS1) 10DLynch: mobileSectionSwitch: make sure the config gets adjusted earlier [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218330 (https://phabricator.wikimedia.org/T410803) [17:00:04] !log urbanecm@deploy2002 mwscript-k8s job started: GrowthExperiments:fixLinkRecommendationData --wiki=itwiki --dry-run --search-index --db-table # T412040-fix-dryrun [17:00:08] T412040: Add a Link: repopulate "Add a Link" suggestions for itwiki - https://phabricator.wikimedia.org/T412040 [17:01:05] 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11461220 (10Urbanecm_WMF) FWIW, the third deployment attempt has succeeded today. But, I still have very little idea about what went wrong and why, so I'm leaving... [17:02:04] (03CR) 10Clément Goubert: [C:03+1] mw-*: Update to Envoy 1.35.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217610 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [17:02:12] (03CR) 10Clément Goubert: [C:03+1] mw-videoscaler: Update to Envoy 1.35.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217609 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [17:02:20] (03CR) 10Clément Goubert: [C:03+1] {api,rest}-gateway: Update to Envoy 1.35.7 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217611 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [17:02:54] (03PS2) 10Dzahn: trafficserver: add a map for wikipedia25.org to miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1216855 (https://phabricator.wikimedia.org/T408592) [17:03:24] (03CR) 10Dzahn: [C:03+1] "Jelto, I think we can get this out of the way now." [puppet] - 10https://gerrit.wikimedia.org/r/1216855 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [17:04:53] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [17:05:38] (03PS1) 10Andrew Bogott: wikitech-static: associate with an elastic IP stored in AWS [dns] - 10https://gerrit.wikimedia.org/r/1218333 (https://phabricator.wikimedia.org/T376400) [17:06:51] (03CR) 10Dzahn: [C:04-1] "if wikipedia25.org and www.wikipedia25.org are both switched over to k8s in DNS - then we can reduce this change to remove .org but leave " [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [17:08:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [17:12:00] Since we're in-between the infrastructure and Wikidata windows, any objection to me quickly deploying https://gerrit.wikimedia.org/r/1218330 to fix an issue with an a/b test? [17:12:23] (03PS2) 10Dzahn: ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) [17:12:40] (03CR) 10CI reject: [V:04-1] ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [17:12:52] (03CR) 10Muehlenhoff: [C:03+2] Create a new access group for access to Jumbo Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/1215157 (https://phabricator.wikimedia.org/T411774) (owner: 10Muehlenhoff) [17:14:25] (03PS1) 10Tiziano Fogli: icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 [17:14:45] (03PS1) 10Btullis: Increase the resources available to the spark-toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218336 (https://phabricator.wikimedia.org/T406833) [17:14:56] (03CR) 10CI reject: [V:04-1] icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (owner: 10Tiziano Fogli) [17:14:56] (03CR) 10CI reject: [V:04-1] Increase the resources available to the spark-toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218336 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [17:15:01] (03PS2) 10Btullis: Increase the resources available to the spark-toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218336 (https://phabricator.wikimedia.org/T406833) [17:16:00] (03PS1) 10Muehlenhoff: Add javiermonton to kafka-jumbo-access group [puppet] - 10https://gerrit.wikimedia.org/r/1218337 [17:16:42] (03CR) 10CI reject: [V:04-1] Add javiermonton to kafka-jumbo-access group [puppet] - 10https://gerrit.wikimedia.org/r/1218337 (owner: 10Muehlenhoff) [17:16:54] (03PS2) 10Muehlenhoff: Add javiermonton to kafka-jumbo-access group [puppet] - 10https://gerrit.wikimedia.org/r/1218337 (https://phabricator.wikimedia.org/T411774) [17:17:51] (03PS2) 10Tiziano Fogli: icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) [17:18:47] No objections, so I am going for it. [17:19:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218330 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [17:20:33] (03PS1) 10Alexandros Kosiaris: multi-dc: Switch www.wikifunctions to Single-DC [puppet] - 10https://gerrit.wikimedia.org/r/1218338 (https://phabricator.wikimedia.org/T405461) [17:20:41] (03Merged) 10jenkins-bot: mobileSectionSwitch: make sure the config gets adjusted earlier [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218330 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [17:21:02] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1218330|mobileSectionSwitch: make sure the config gets adjusted earlier (T410803)]] [17:21:06] T410803: Create data stream for mobile web section editing dead-end intervention - https://phabricator.wikimedia.org/T410803 [17:22:25] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06serviceops, 07Datacenter-Switchover: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11461316 (10Clement_Goubert) Not strictly, but it would be nice to have for peace of mind. This may be... [17:22:56] (03CR) 10Btullis: [C:03+2] Increase the resources available to the spark-toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218336 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [17:22:57] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1218330|mobileSectionSwitch: make sure the config gets adjusted earlier (T410803)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:23:49] !log kemayo@deploy2002 kemayo: Continuing with sync [17:25:28] (03Merged) 10jenkins-bot: Increase the resources available to the spark-toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218336 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [17:25:30] (03CR) 10Elukey: ml-build: add docker-pkg (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [17:26:27] (03CR) 10Elukey: ml-build: add docker-pkg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [17:27:01] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [17:27:23] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [17:27:48] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218330|mobileSectionSwitch: make sure the config gets adjusted earlier (T410803)]] (duration: 06m 46s) [17:27:53] T410803: Create data stream for mobile web section editing dead-end intervention - https://phabricator.wikimedia.org/T410803 [17:28:32] (03CR) 10Elukey: ml-build: add docker-pkg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [17:29:31] (03PS9) 10Dzahn: unlink wikipedia25.[org|com] from ncredir, point to k8s-ingress [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) [17:29:49] (03PS10) 10Dzahn: unlink wikipedia25.org from ncredir, point to k8s-ingress [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) [17:31:23] (03CR) 10Dzahn: [V:03+1 C:03+2] releases: use stunnel with rsync from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1217594 (https://phabricator.wikimedia.org/T289858) (owner: 10Dzahn) [17:32:45] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota Has worsened - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:34:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11461353 (10Clement_Goubert) Thanks for the heads up @VRiley-WMF. I've tried to plan out how the balance of servers per row would look like after the refreshes in [[ https:... [17:36:47] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:38:10] (03CR) 10Tiziano Fogli: "Please merge after:" [dns] - 10https://gerrit.wikimedia.org/r/1218333 (https://phabricator.wikimedia.org/T376400) (owner: 10Andrew Bogott) [17:52:26] (03PS1) 10Dreamy Jazz: Remove definition of wgGlobalBlockingEnableAutoblocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218342 (https://phabricator.wikimedia.org/T379086) [17:52:46] RESOLVED: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota Has worsened - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:53:09] jouncebot: nowandnext [17:53:10] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [17:53:10] In 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1800) [17:53:10] In 0 hour(s) and 6 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1800) [17:53:21] Anyone using the next window? [17:56:21] Actually I need to go, so disregard that question [17:59:53] (03PS1) 10Dreamy Jazz: Show global autoblocks in the globalblocks list API response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218343 (https://phabricator.wikimedia.org/T379087) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1800) [18:00:05] ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T1800). [18:01:40] o/ I'll ship some MW envoy upgrades in the infra window [18:04:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11461432 (10Jhancock.wm) @cmooney i checked what we had in stock. We don't have that SFP. and could use a new 5 meter fiber for this task. [18:04:25] (03CR) 10RLazarus: [C:03+2] mw-*: Update to Envoy 1.35.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217610 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [18:04:28] (03PS3) 10Andrew Bogott: icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [18:04:29] (03PS1) 10Andrew Bogott: icinga: remove wikitech-static mediawiki version check [puppet] - 10https://gerrit.wikimedia.org/r/1218344 (https://phabricator.wikimedia.org/T376400) [18:06:17] (03Merged) 10jenkins-bot: mw-*: Update to Envoy 1.35.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217610 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [18:06:29] (03CR) 10Urbanecm: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [18:06:49] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218344 (https://phabricator.wikimedia.org/T376400) (owner: 10Andrew Bogott) [18:07:05] (03CR) 10Urbanecm: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [18:08:38] !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1217610 T410975 [18:08:42] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [18:10:56] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on releases1003.eqiad.wmnet with reason: T289858 [18:11:03] T289858: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 [18:11:17] (03PS1) 10Cathal Mooney: Swift-proxy: set DSCP on outbound packets to AF41 for network QoS [puppet] - 10https://gerrit.wikimedia.org/r/1218347 [18:12:08] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218347 (owner: 10Cathal Mooney) [18:12:21] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on releases2003.codfw.wmnet with reason: T289858 [18:12:23] (03PS2) 10Andrew Bogott: icinga: remove wikitech-static mediawiki version check [puppet] - 10https://gerrit.wikimedia.org/r/1218344 (https://phabricator.wikimedia.org/T376400) [18:13:19] !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1217610 T410975 (duration: 05m 20s) [18:14:44] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218344 (https://phabricator.wikimedia.org/T376400) (owner: 10Andrew Bogott) [18:15:22] (03PS2) 10Cathal Mooney: Swift-proxy: set DSCP on outbound packets to AF41 for network QoS [puppet] - 10https://gerrit.wikimedia.org/r/1218347 [18:16:46] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218347 (owner: 10Cathal Mooney) [18:17:04] (03CR) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [18:19:31] done with the infra window -- I was thinking about deploying https://gerrit.wikimedia.org/r/1217609 as well but I'd rather wait until all the subject matter experts aren't on PTO [18:20:45] (03PS3) 10Andrew Bogott: icinga: remove wikitech-static mediawiki version check [puppet] - 10https://gerrit.wikimedia.org/r/1218344 (https://phabricator.wikimedia.org/T376400) [18:20:50] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218344 (https://phabricator.wikimedia.org/T376400) (owner: 10Andrew Bogott) [18:21:24] (03CR) 10CI reject: [V:04-1] icinga: remove wikitech-static mediawiki version check [puppet] - 10https://gerrit.wikimedia.org/r/1218344 (https://phabricator.wikimedia.org/T376400) (owner: 10Andrew Bogott) [18:21:56] (03PS3) 10Cathal Mooney: Swift-proxy: set DSCP on outbound packets to AF41 for network QoS [puppet] - 10https://gerrit.wikimedia.org/r/1218347 [18:22:46] (03PS1) 10Dzahn: deployment::server: allow releases hosts encrypted rsync [puppet] - 10https://gerrit.wikimedia.org/r/1218348 (https://phabricator.wikimedia.org/T289858) [18:23:16] (03CR) 10CI reject: [V:04-1] deployment::server: allow releases hosts encrypted rsync [puppet] - 10https://gerrit.wikimedia.org/r/1218348 (https://phabricator.wikimedia.org/T289858) (owner: 10Dzahn) [18:23:24] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218347 (owner: 10Cathal Mooney) [18:23:25] (03PS2) 10Dzahn: deployment::server: allow releases hosts encrypted rsync [puppet] - 10https://gerrit.wikimedia.org/r/1218348 (https://phabricator.wikimedia.org/T289858) [18:23:54] (03CR) 10CI reject: [V:04-1] deployment::server: allow releases hosts encrypted rsync [puppet] - 10https://gerrit.wikimedia.org/r/1218348 (https://phabricator.wikimedia.org/T289858) (owner: 10Dzahn) [18:24:48] (03PS4) 10Andrew Bogott: icinga: remove wikitech-static mediawiki version check [puppet] - 10https://gerrit.wikimedia.org/r/1218344 (https://phabricator.wikimedia.org/T376400) [18:24:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218344 (https://phabricator.wikimedia.org/T376400) (owner: 10Andrew Bogott) [18:25:01] (03PS3) 10Dzahn: deployment::server: allow releases hosts encrypted rsync [puppet] - 10https://gerrit.wikimedia.org/r/1218348 (https://phabricator.wikimedia.org/T289858) [18:31:32] (03PS6) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) [18:32:19] (03CR) 10CI reject: [V:04-1] Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [18:33:10] (03CR) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [18:34:48] (03PS7) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) [18:47:24] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11461551 (10MoritzMuehlenhoff) [18:50:54] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [18:51:08] (03CR) 10CDanis: [C:03+1] Swift-proxy: set DSCP on outbound packets to AF41 for network QoS [puppet] - 10https://gerrit.wikimedia.org/r/1218347 (owner: 10Cathal Mooney) [18:55:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:06:07] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 315019008 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:08:07] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 51856 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:35:14] (03PS1) 10Jasmine: wmnet: add DNS records for new stacked control planes wikikube-ctrl200[45] [dns] - 10https://gerrit.wikimedia.org/r/1218351 (https://phabricator.wikimedia.org/T390861) [19:52:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [19:52:44] Deployment mw-jobrunner.codfw.canary in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.canary - ... [19:52:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [19:55:07] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 191657168 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:56:24] (03PS2) 10Jasmine: Add wikikube-ctrl2004 and wikikube-ctrl2005 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1218351 (https://phabricator.wikimedia.org/T390861) [19:58:07] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 129344 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:05:46] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:07:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [20:07:44] Deployment mw-jobrunner.codfw.canary in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.canary - ... [20:07:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [20:08:16] (03PS3) 10Aqu: Allow tests of canary events generation from airflow-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218232 (https://phabricator.wikimedia.org/T411989) [20:10:15] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:10:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86625 and previous config saved to /var/cache/conftool/dbconfig/20251215-201051-marostegui.json [20:10:57] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:10:58] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:22:09] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: PNG image served with image/svg+xml content type - https://phabricator.wikimedia.org/T225586#11461757 (10TheDJ) [20:23:39] (03PS3) 10Jasmine: Add wikikube-ctrl2004 and wikikube-ctrl2005 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1218351 (https://phabricator.wikimedia.org/T390861) [20:26:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P86626 and previous config saved to /var/cache/conftool/dbconfig/20251215-202559-marostegui.json [20:41:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P86627 and previous config saved to /var/cache/conftool/dbconfig/20251215-204108-marostegui.json [20:56:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86628 and previous config saved to /var/cache/conftool/dbconfig/20251215-205616-marostegui.json [20:56:24] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:56:25] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:56:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [20:56:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2159 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86629 and previous config saved to /var/cache/conftool/dbconfig/20251215-205643-marostegui.json [20:59:58] 10SRE-Access-Requests: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755 (10CDobbins) 03NEW [21:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T2100) [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:44] 10SRE-Access-Requests: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755#11462070 (10CDobbins) [21:02:06] (03PS1) 10CDobbins: admin: add fido-based ssh access for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1218360 (https://phabricator.wikimedia.org/T412755) [21:02:34] 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755#11462075 (10CDobbins) [21:09:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:14:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:19:16] (03CR) 10Dzahn: [V:03+1 C:03+2] "also needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/1218348" [puppet] - 10https://gerrit.wikimedia.org/r/1217594 (https://phabricator.wikimedia.org/T289858) (owner: 10Dzahn) [21:19:18] (03CR) 10Dzahn: [C:03+2] deployment::server: allow releases hosts encrypted rsync [puppet] - 10https://gerrit.wikimedia.org/r/1218348 (https://phabricator.wikimedia.org/T289858) (owner: 10Dzahn) [21:33:01] (03Abandoned) 10Herron: pyrra: wikifunctions: add eqiad/codfw site variants [puppet] - 10https://gerrit.wikimedia.org/r/1211177 (https://phabricator.wikimedia.org/T407503) (owner: 10Herron) [21:37:02] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:44:29] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on releases2003.codfw.wmnet with reason: T289858 [21:44:33] T289858: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 [21:44:52] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on releases1003.eqiad.wmnet with reason: T289858 [21:52:54] (03PS1) 10Bking: opensearch-on-k8s: Update opensearch image from 2.7.0 -> 2.19.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218363 (https://phabricator.wikimedia.org/T412447) [22:00:05] Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251215T2200). [22:07:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86630 and previous config saved to /var/cache/conftool/dbconfig/20251215-220751-marostegui.json [22:07:58] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:07:58] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:14:20] (03PS1) 10Pppery: Replace backtick operator with shell_exec [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1218364 [22:15:15] (03PS2) 10Bking: opensearch-on-k8s: Update opensearch image from 2.7.0 -> 2.19.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218363 (https://phabricator.wikimedia.org/T412447) [22:18:46] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1218360 (https://phabricator.wikimedia.org/T412755) (owner: 10CDobbins) [22:21:53] urbanecm: RE "curious, what was the UPDATE for?" – Myself and other MwEng folks were investigating T405461 at an offsite in Lisbon last week, and reproducing it step by step. Purge and edits do a lot so we wanted to isolate it to just this one step to see how it impacts ParserCache replication and invalidation. [22:21:54] T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem - https://phabricator.wikimedia.org/T405461 [22:23:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P86631 and previous config saved to /var/cache/conftool/dbconfig/20251215-222300-marostegui.json [22:23:19] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on releases1003.eqiad.wmnet with reason: T289858 [22:23:23] T289858: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 [22:23:32] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on releases2003.codfw.wmnet with reason: T289858 [22:26:47] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:33:22] (03PS1) 10MusikAnimal: Use CodeMirror instead of CodeEditor for beta feature users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218368 (https://phabricator.wikimedia.org/T373711) [22:35:48] (03CR) 10Bking: [C:03+2] "self-merging, as I deployed successfully with the new image in the `opensearch-test` namespace on dse-k8s-eqiad." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218363 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [22:38:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P86632 and previous config saved to /var/cache/conftool/dbconfig/20251215-223808-marostegui.json [22:50:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218368 (https://phabricator.wikimedia.org/T373711) (owner: 10MusikAnimal) [22:51:10] (03Merged) 10jenkins-bot: Use CodeMirror instead of CodeEditor for beta feature users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218368 (https://phabricator.wikimedia.org/T373711) (owner: 10MusikAnimal) [22:51:33] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1218368|Use CodeMirror instead of CodeEditor for beta feature users (T373711)]] [22:51:37] T373711: Add support for Scribunto, JavaScript, CSS, JSON and Vue to CodeMirror 6 - https://phabricator.wikimedia.org/T373711 [22:52:52] (03PS2) 10Aaron Schulz: rest-gateway: support REST sandbox requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207267 (https://phabricator.wikimedia.org/T396807) [22:53:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86633 and previous config saved to /var/cache/conftool/dbconfig/20251215-225317-marostegui.json [22:53:23] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:53:23] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:53:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [22:55:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:56:15] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1218368|Use CodeMirror instead of CodeEditor for beta feature users (T373711)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:59:37] !log musikanimal@deploy2002 musikanimal: Continuing with sync [23:00:31] (03PS1) 10MusikAnimal: CodeMirrorWikiEditor: add style module to prevent FOUCs [extensions/CodeMirror] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218372 [23:03:50] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218368|Use CodeMirror instead of CodeEditor for beta feature users (T373711)]] (duration: 12m 17s) [23:03:54] T373711: Add support for Scribunto, JavaScript, CSS, JSON and Vue to CodeMirror 6 - https://phabricator.wikimedia.org/T373711 [23:04:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CodeMirror] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218372 (owner: 10MusikAnimal) [23:13:07] (03PS1) 10DLynch: mobileSectionSwitch: make sure session ID isn't regenerated each time [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218374 (https://phabricator.wikimedia.org/T410803) [23:13:47] (03Merged) 10jenkins-bot: CodeMirrorWikiEditor: add style module to prevent FOUCs [extensions/CodeMirror] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218372 (owner: 10MusikAnimal) [23:14:09] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1218372|CodeMirrorWikiEditor: add style module to prevent FOUCs]] [23:15:07] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 129760144 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:16:15] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1218372|CodeMirrorWikiEditor: add style module to prevent FOUCs]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:17:06] !log musikanimal@deploy2002 musikanimal: Continuing with sync [23:17:07] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 82312 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:21:12] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218372|CodeMirrorWikiEditor: add style module to prevent FOUCs]] (duration: 07m 03s) [23:22:02] musikanimal: Got any more to do, or can I jump in? [23:22:10] all yours! [23:22:13] Thanks! [23:22:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218374 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [23:30:16] (03Merged) 10jenkins-bot: mobileSectionSwitch: make sure session ID isn't regenerated each time [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218374 (https://phabricator.wikimedia.org/T410803) (owner: 10DLynch) [23:30:38] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1218374|mobileSectionSwitch: make sure session ID isn't regenerated each time (T410803)]] [23:30:42] T410803: Create data stream for mobile web section editing dead-end intervention - https://phabricator.wikimedia.org/T410803 [23:32:47] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1218374|mobileSectionSwitch: make sure session ID isn't regenerated each time (T410803)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:33:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:33:55] !log kemayo@deploy2002 kemayo: Continuing with sync [23:37:55] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218374|mobileSectionSwitch: make sure session ID isn't regenerated each time (T410803)]] (duration: 07m 17s) [23:38:00] T410803: Create data stream for mobile web section editing dead-end intervention - https://phabricator.wikimedia.org/T410803 [23:40:20] (03PS1) 10Btullis: Update the hostname used by dbt in the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218379 (https://phabricator.wikimedia.org/T410017) [23:43:37] (03CR) 10Btullis: [C:03+2] Update the hostname used by dbt in the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218379 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [23:43:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [23:45:20] (03Merged) 10jenkins-bot: Update the hostname used by dbt in the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218379 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [23:46:34] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [23:46:41] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [23:49:49] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#11462541 (10cmooney) This has come up again in terms of the pages we have been getting of late, and we may take some action to chan... [23:58:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:58:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS