[00:08:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1168652 [00:08:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1168652 (owner: 10TrainBranchBot) [00:29:20] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1168652 (owner: 10TrainBranchBot) [00:34:08] (03PS1) 10DDesouza: Deploy Readers Use Cases Survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168653 (https://phabricator.wikimedia.org/T398870) [00:36:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168653 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [00:46:40] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/86bd61a94b76cd792513ee60d667527fd01d6ccce2b09ff6b0ed96e1a4c03818/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:06:40] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:22:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:52:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:55:07] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:33:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:38:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:40:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:05:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:15:13] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:40:40] FIRING: [3x] SystemdUnitFailed: cowbuilder_update_buster-amd64.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:32:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:34:16] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [05:35:21] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [05:38:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:39:09] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [05:40:21] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [05:44:50] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [05:45:17] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [05:49:33] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [05:50:25] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [05:55:07] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:00:40] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10998799 (10Marostegui) a:05VRiley-WMF→03FCeratto-WMF @VRiley-WMF thank you so much, we can take it from here. @FCeratto-WMF the host is accessible, please productionize it: puppet + recloning + po... [06:03:12] (03PS1) 10Marostegui: dbproxy1022,dbproxy1024: Test db1213 for m1 [puppet] - 10https://gerrit.wikimedia.org/r/1168668 (https://phabricator.wikimedia.org/T399172) [06:04:12] (03CR) 10Marostegui: [C:03+2] dbproxy1022,dbproxy1024: Test db1213 for m1 [puppet] - 10https://gerrit.wikimedia.org/r/1168668 (https://phabricator.wikimedia.org/T399172) (owner: 10Marostegui) [06:08:10] (03PS1) 10Marostegui: Revert "dbproxy1022,dbproxy1024: Test db1213 for m1" [puppet] - 10https://gerrit.wikimedia.org/r/1168669 [06:08:39] (03CR) 10Marostegui: "There was a typo on dbproxy1024, but the IP+hostname worked fine on 1022." [puppet] - 10https://gerrit.wikimedia.org/r/1168669 (owner: 10Marostegui) [06:09:42] (03CR) 10Marostegui: [C:03+2] Revert "dbproxy1022,dbproxy1024: Test db1213 for m1" [puppet] - 10https://gerrit.wikimedia.org/r/1168669 (owner: 10Marostegui) [06:15:01] (03PS1) 10Marostegui: db1213: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1168670 (https://phabricator.wikimedia.org/T399172) [06:15:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2232].codfw.wmnet,db[1207,1213,1217].eqiad.wmnet with reason: Primary switchover m1 T399172 [06:15:24] T399172: Switchover m1 master (db1207 -> db1213) - https://phabricator.wikimedia.org/T399172 [06:15:34] (03CR) 10Marostegui: [C:03+2] db1213: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1168670 (https://phabricator.wikimedia.org/T399172) (owner: 10Marostegui) [06:15:58] (03PS1) 10Muehlenhoff: Remove buster cowbuilder environment [puppet] - 10https://gerrit.wikimedia.org/r/1168671 (https://phabricator.wikimedia.org/T397209) [06:16:29] (03PS2) 10Muehlenhoff: Remove buster cowbuilder environment [puppet] - 10https://gerrit.wikimedia.org/r/1168671 (https://phabricator.wikimedia.org/T397209) [06:18:43] (03PS1) 10Marostegui: mariadb: Promote db1213 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/1168672 (https://phabricator.wikimedia.org/T399172) [06:18:49] (03CR) 10Muehlenhoff: [C:03+2] Remove buster cowbuilder environment [puppet] - 10https://gerrit.wikimedia.org/r/1168671 (https://phabricator.wikimedia.org/T397209) (owner: 10Muehlenhoff) [06:19:26] (03CR) 10Marostegui: mariadb: Change backups host [puppet] - 10https://gerrit.wikimedia.org/r/1167833 (https://phabricator.wikimedia.org/T399172) (owner: 10Marostegui) [06:19:33] (03PS2) 10Marostegui: mariadb: Change backups host [puppet] - 10https://gerrit.wikimedia.org/r/1167833 (https://phabricator.wikimedia.org/T399172) [06:19:38] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1213 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/1168672 (https://phabricator.wikimedia.org/T399172) (owner: 10Marostegui) [06:23:44] !log Failover m1 from db1207 to db1213 - T399172 [06:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:48] T399172: Switchover m1 master (db1207 -> db1213) - https://phabricator.wikimedia.org/T399172 [06:26:00] (03CR) 10Marostegui: [C:03+2] mariadb: Change backups host [puppet] - 10https://gerrit.wikimedia.org/r/1167833 (https://phabricator.wikimedia.org/T399172) (owner: 10Marostegui) [06:28:07] (03PS1) 10Marostegui: db1207: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168673 (https://phabricator.wikimedia.org/T399060) [06:28:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [06:29:45] (03CR) 10Marostegui: [C:03+2] db1207: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168673 (https://phabricator.wikimedia.org/T399060) (owner: 10Marostegui) [06:30:25] FIRING: [5x] SystemdUnitFailed: cowbuilder_update_buster-amd64.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:33:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:34:18] (03PS1) 10Muehlenhoff: docker::baseimages: Stop building new buster base images [puppet] - 10https://gerrit.wikimedia.org/r/1168675 (https://phabricator.wikimedia.org/T397209) [06:35:55] FIRING: [5x] SystemdUnitFailed: cowbuilder_update_buster-amd64.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:38:31] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1168179 (owner: 10Muehlenhoff) [06:39:07] (03PS1) 10Marostegui: mariadb: Move db1207 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/1168676 (https://phabricator.wikimedia.org/T399430) [06:39:55] (03PS2) 10Marostegui: mariadb: Move db1207 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/1168676 (https://phabricator.wikimedia.org/T399430) [06:40:42] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1207 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/1168676 (https://phabricator.wikimedia.org/T399430) (owner: 10Marostegui) [06:41:27] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1200.eqiad.wmnet onto db1207.eqiad.wmnet [06:41:30] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db1200 - Depool db1200.eqiad.wmnet to then clone it to db1207.eqiad.wmnet - marostegui@cumin1002 [06:41:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1200 - Depool db1200.eqiad.wmnet to then clone it to db1207.eqiad.wmnet - marostegui@cumin1002 [06:48:52] (03PS1) 10Muehlenhoff: Remove access for dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/1168749 [06:51:00] (03CR) 10Muehlenhoff: [C:03+2] Remove access for dalezhou [puppet] - 10https://gerrit.wikimedia.org/r/1168749 (owner: 10Muehlenhoff) [06:52:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance [06:52:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T399249)', diff saved to https://phabricator.wikimedia.org/P78937 and previous config saved to /var/cache/conftool/dbconfig/20250714-065240-marostegui.json [06:52:44] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [06:54:07] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Dale Zhou out of all services on: 2395 hosts [06:55:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1168404 (https://phabricator.wikimedia.org/T398686) (owner: 10Cathal Mooney) [06:58:14] (03PS1) 10Abijeet Patro: CX: Remove unused config related to database and cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168757 (https://phabricator.wikimedia.org/T348513) [07:00:05] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:00] (03CR) 10Elukey: [C:03+1] docker::baseimages: Stop building new buster base images [puppet] - 10https://gerrit.wikimedia.org/r/1168675 (https://phabricator.wikimedia.org/T397209) (owner: 10Muehlenhoff) [07:02:05] (03PS2) 10Abijeet Patro: CX: Remove unused config related to database and cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168757 (https://phabricator.wikimedia.org/T348513) [07:03:25] (03PS2) 10KartikMistry: machinetranslationt: Use s3 model storage for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167854 (https://phabricator.wikimedia.org/T335491) [07:04:14] (03CR) 10Nikerabbit: [C:03+1] CX: Remove unused config related to database and cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168757 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [07:05:27] (03CR) 10Elukey: [C:03+1] "LGTM! Please make sure that you don't have overrides in staging since it may become difficult in the future to find them (namely, unless y" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167854 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [07:08:32] (03PS1) 10Muehlenhoff: Convert Nuria's access to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1168835 (https://phabricator.wikimedia.org/T397850) [07:08:57] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [07:11:01] 06SRE, 10decommission-hardware, 06Infrastructure-Foundations: decommission sretest1001 - https://phabricator.wikimedia.org/T399435 (10MoritzMuehlenhoff) 03NEW [07:12:47] (03PS1) 10Muehlenhoff: Update test host for unpriv Cumin [puppet] - 10https://gerrit.wikimedia.org/r/1168837 (https://phabricator.wikimedia.org/T399435) [07:14:19] (03PS1) 10Marostegui: instances.yaml: Add db1207 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1168838 (https://phabricator.wikimedia.org/T399430) [07:14:56] (03PS2) 10Muehlenhoff: No longer use mirrors.debian.org on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1160171 (https://phabricator.wikimedia.org/T397209) [07:15:13] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:15:29] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1207 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1168838 (https://phabricator.wikimedia.org/T399430) (owner: 10Marostegui) [07:18:11] (03PS3) 10Muehlenhoff: No longer use mirrors.debian.org on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1160171 (https://phabricator.wikimedia.org/T397209) [07:19:09] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [07:20:36] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [07:21:25] (03PS1) 10Jelto: miscweb: remove design-style-guide [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168839 (https://phabricator.wikimedia.org/T360362) [07:21:30] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [07:22:05] marostegui@cumin1002 clone (PID 141771) is awaiting input [07:23:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160171 (https://phabricator.wikimedia.org/T397209) (owner: 10Muehlenhoff) [07:26:39] (03CR) 10Vgutierrez: [C:04-1] cache::haproxy: add x_analytics log variable to http frontend too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [07:27:12] (03PS1) 10Elukey: admin_ng: bump memory quota for kartotherian on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168840 [07:30:11] (03CR) 10Vgutierrez: [C:04-1] cache::haproxy: add x_analytics log variable to http frontend too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [07:30:25] FIRING: [4x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:30:55] (03CR) 10Muehlenhoff: [C:03+2] Convert Nuria's access to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1168835 (https://phabricator.wikimedia.org/T397850) (owner: 10Muehlenhoff) [07:33:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T399249)', diff saved to https://phabricator.wikimedia.org/P78938 and previous config saved to /var/cache/conftool/dbconfig/20250714-073354-marostegui.json [07:33:59] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:34:23] (03PS2) 10Fabfur: cache::haproxy: add x_analytics log variable to http frontend too [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) [07:34:28] (03CR) 10Fabfur: cache::haproxy: add x_analytics log variable to http frontend too (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [07:38:22] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1168837 (https://phabricator.wikimedia.org/T399435) (owner: 10Muehlenhoff) [07:38:33] (03CR) 10Vgutierrez: cache::haproxy: add x_analytics log variable to http frontend too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [07:42:51] (03CR) 10Muehlenhoff: [C:03+2] Update test host for unpriv Cumin [puppet] - 10https://gerrit.wikimedia.org/r/1168837 (https://phabricator.wikimedia.org/T399435) (owner: 10Muehlenhoff) [07:44:35] (03PS1) 10Muehlenhoff: Enable sretest1002 for kerberized SSH access [puppet] - 10https://gerrit.wikimedia.org/r/1169029 [07:48:56] (03CR) 10Vgutierrez: [C:04-1] cache::haproxy: add x_analytics log variable to http frontend too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [07:49:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P78939 and previous config saved to /var/cache/conftool/dbconfig/20250714-074902-marostegui.json [07:52:08] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset for OKryva-WMF - https://phabricator.wikimedia.org/T399436 (10OKryva-WMF) 03NEW [07:58:53] (03CR) 10Vgutierrez: [C:04-1] cache::haproxy: add x_analytics log variable to http frontend too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [08:04:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P78940 and previous config saved to /var/cache/conftool/dbconfig/20250714-080409-marostegui.json [08:04:24] (03PS2) 10Volans: I/F: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886 [08:04:40] (03CR) 10Volans: "I had forgot to add `argument_task_required` to a couple of them, sorry." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886 (owner: 10Volans) [08:05:22] (03CR) 10Elukey: [V:03+1 C:03+2] pyrra: remove multi-dc for istio-based SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1166076 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [08:05:26] (03CR) 10Volans: "Makes sense to me, but I'll leave to o11y to decide what to do." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167887 (owner: 10Volans) [08:09:40] (03PS2) 10Volans: Collab: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 [08:09:47] (03CR) 10Volans: "addressed comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 (owner: 10Volans) [08:10:54] (03PS2) 10Filippo Giunchedi: o11y: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167887 (owner: 10Volans) [08:11:22] (03CR) 10Filippo Giunchedi: [C:03+1] "I went ahead and added raises=False, LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167887 (owner: 10Volans) [08:13:44] 10SRE-swift-storage, 10MinT, 10LPL Essential (2025 Jul-Sep), 13Patch-For-Review: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10999059 (10Nikerabbit) [08:16:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM also for the PS1->PS2 changes" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886 (owner: 10Volans) [08:17:35] (03CR) 10Muehlenhoff: [C:03+2] Enable sretest1002 for kerberized SSH access [puppet] - 10https://gerrit.wikimedia.org/r/1169029 (owner: 10Muehlenhoff) [08:19:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T399249)', diff saved to https://phabricator.wikimedia.org/P78941 and previous config saved to /var/cache/conftool/dbconfig/20250714-081917-marostegui.json [08:19:21] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [08:19:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [08:19:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T399249)', diff saved to https://phabricator.wikimedia.org/P78942 and previous config saved to /var/cache/conftool/dbconfig/20250714-081939-marostegui.json [08:25:50] !log volans@cumin2002 START - Cookbook sre.network.debug for Netbox circuit ID 93 [08:26:00] !log volans@cumin2002 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 93 [08:26:24] (03CR) 10Volans: [C:03+2] I/F: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886 (owner: 10Volans) [08:27:30] (03PS1) 10Elukey: pyrra: fix latency and ratio ensures for Istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1169033 [08:27:48] 10SRE-swift-storage, 10MinT, 10LPL Essential (2025 Jul-Sep), 10LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5), 13Patch-For-Review: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10999110 (10Nikerabbit) [08:28:07] (03CR) 10Volans: "Yes that's left to the cookbook owners to decide what to do, we didn't want to make the assumption to always skip or always fail in case a" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 (owner: 10Volans) [08:28:09] jouncebot: nowandnext [08:28:09] No deployments scheduled for the next 1 hour(s) and 31 minute(s) [08:28:09] In 1 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1000) [08:28:18] (03CR) 10Kosta Harlan: WIP: Prep hCaptcha config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [08:30:13] (03CR) 10Elukey: [C:03+2] pyrra: fix latency and ratio ensures for Istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1169033 (owner: 10Elukey) [08:30:38] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1259.eqiad.wmnet with reason: New host setup [08:30:50] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10999124 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=94d69bcb-2e00-40fe-be67-efbd8d2cf1e5) set by fceratto@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with rea... [08:32:05] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts sretest1001.eqiad.wmnet [08:32:30] (03CR) 10Volans: "Do you have any feedback on this?" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166233 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:33:15] (03Merged) 10jenkins-bot: I/F: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886 (owner: 10Volans) [08:33:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:35:41] (03PS5) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) [08:37:03] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6246/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [08:39:12] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [08:39:44] (03PS6) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) [08:40:00] (03PS1) 10Marostegui: instances.yaml: Add es1047 [puppet] - 10https://gerrit.wikimedia.org/r/1169037 (https://phabricator.wikimedia.org/T395771) [08:40:12] (03CR) 10Filippo Giunchedi: [C:03+2] o11y: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167887 (owner: 10Volans) [08:40:28] (03PS7) 10Reedy: WIP: Prep hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) [08:40:34] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6247/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [08:40:41] (03CR) 10Reedy: WIP: Prep hCaptcha config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [08:42:13] (03CR) 10Vgutierrez: [C:04-1] cache::haproxy: add x_analytics log variable to http frontend too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [08:42:36] (03PS1) 10Marostegui: Prepare db1259 for production, remove db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1169034 (https://phabricator.wikimedia.org/T393296) (owner: 10Federico Ceratto) [08:42:36] (03CR) 10Marostegui: [C:04-1] "Leave db1246.yaml untouched for now. We need to see how to start a decommissioning process for a host that has already being decommissione" [puppet] - 10https://gerrit.wikimedia.org/r/1169034 (https://phabricator.wikimedia.org/T393296) (owner: 10Federico Ceratto) [08:43:07] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1047 [puppet] - 10https://gerrit.wikimedia.org/r/1169037 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [08:44:42] jmm@cumin1003 decommission (PID 1654808) is awaiting input [08:45:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es1047 to es6 depooled T395771', diff saved to https://phabricator.wikimedia.org/P78943 and previous config saved to /var/cache/conftool/dbconfig/20250714-084506-marostegui.json [08:45:10] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [08:46:14] (03PS7) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) [08:47:10] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6248/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [08:48:29] (03PS8) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) [08:49:26] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6249/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [08:49:59] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [08:50:15] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [08:50:16] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:50:17] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts sretest1001.eqiad.wmnet [08:50:23] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Add phan and use it to detect duplicated array keys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [08:50:24] (03PS2) 10Federico Ceratto: Prepare db1259 for production, remove db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1169034 (https://phabricator.wikimedia.org/T393296) [08:50:27] 06SRE, 10decommission-hardware, 06Infrastructure-Foundations: decommission sretest1001 - https://phabricator.wikimedia.org/T399435#10999195 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `sretest1001.eqiad.wmnet` - sretest1001.eqiad.wmnet (**PASS**) - Downtimed... [08:50:45] (03PS1) 10Marostegui: es1047: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169039 (https://phabricator.wikimedia.org/T395771) [08:51:29] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db1200 gradually with 4 steps - Pool db1200.eqiad.wmnet in after cloning [08:52:07] (03CR) 10Marostegui: [C:03+2] es1047: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169039 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [08:52:29] (03PS1) 10Muehlenhoff: Remove dummy keytab for sretest1001 (decommed) [labs/private] - 10https://gerrit.wikimedia.org/r/1169040 [08:53:25] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:53:51] (03CR) 10Marostegui: [C:04-1] "The change in site.pp is missing. You'd need to remove it from the insetup role and add it to s2 hosts with the proper role." [puppet] - 10https://gerrit.wikimedia.org/r/1169034 (https://phabricator.wikimedia.org/T393296) (owner: 10Federico Ceratto) [08:53:52] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:54:08] (03PS9) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) [08:54:12] (03CR) 10Btullis: [C:03+2] Increase the CPU and memory limits for the spark-history service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168171 (https://phabricator.wikimedia.org/T396617) (owner: 10Btullis) [08:54:14] (03PS1) 10Muehlenhoff: Remove sretest1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1169041 (https://phabricator.wikimedia.org/T399435) [08:54:33] (03CR) 10CI reject: [V:04-1] pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [08:54:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: decommission sretest1001 - https://phabricator.wikimedia.org/T399435#10999205 (10MoritzMuehlenhoff) [08:54:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1047 (re)pooling @ 1%: Pooling for the first time in es6 T395771', diff saved to https://phabricator.wikimedia.org/P78945 and previous config saved to /var/cache/conftool/dbconfig/20250714-085457-root.json [08:54:59] (03CR) 10Muehlenhoff: [C:03+2] Remove sretest1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1169041 (https://phabricator.wikimedia.org/T399435) (owner: 10Muehlenhoff) [08:55:01] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [08:55:53] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2243.codfw.wmnet with reason: Maintenance [08:55:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2243 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78946 and previous config saved to /var/cache/conftool/dbconfig/20250714-085556-marostegui.json [08:56:07] (03Merged) 10jenkins-bot: Increase the CPU and memory limits for the spark-history service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168171 (https://phabricator.wikimedia.org/T396617) (owner: 10Btullis) [08:56:33] (03PS1) 10Marostegui: db2243: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169043 (https://phabricator.wikimedia.org/T399298) [08:58:28] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [09:00:43] (03CR) 10Marostegui: [C:03+2] db2243: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169043 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui) [09:00:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T399249)', diff saved to https://phabricator.wikimedia.org/P78947 and previous config saved to /var/cache/conftool/dbconfig/20250714-090057-marostegui.json [09:01:02] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:01:19] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2243.codfw.wmnet with reason: Maintenance [09:04:23] (03PS10) 10Elukey: pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) [09:04:44] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2243.codfw.wmnet with reason: Maintenance [09:05:18] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6251/console" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [09:05:40] (03PS3) 10Fabfur: cache::haproxy: add x_analytics log variable to http frontend too [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) [09:05:54] (03CR) 10Elukey: [V:03+1] "Finally reached a no-op!" [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [09:06:45] !log jelto@cumin1003 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bookworm [09:07:10] (03CR) 10Fabfur: cache::haproxy: add x_analytics log variable to http frontend too (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [09:08:21] (03PS4) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) [09:08:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [09:09:09] (03CR) 10Elukey: [V:03+1 C:03+2] pyrra: refactor the filesystem class to be more readable [puppet] - 10https://gerrit.wikimedia.org/r/1166135 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [09:10:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1047 (re)pooling @ 5%: Pooling for the first time in es6 T395771', diff saved to https://phabricator.wikimedia.org/P78949 and previous config saved to /var/cache/conftool/dbconfig/20250714-091003-root.json [09:10:05] (03Abandoned) 10Elukey: pyrra: remove multi-dc for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1166149 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [09:10:07] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [09:11:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78950 and previous config saved to /var/cache/conftool/dbconfig/20250714-091121-root.json [09:12:41] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [09:12:54] (03PS3) 10Elukey: pyrra: add tonecheck Pyrra config [puppet] - 10https://gerrit.wikimedia.org/r/1165548 (https://phabricator.wikimedia.org/T390706) [09:13:10] (03PS5) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) [09:13:18] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [09:15:30] (03CR) 10Bartosz Wójtowicz: "Thank you for the review @ltoscano@wikimedia.org <3" [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [09:15:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 9 hosts with reason: Maintenance [09:16:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P78951 and previous config saved to /var/cache/conftool/dbconfig/20250714-091605-marostegui.json [09:16:15] !log Stop mariadb on db1154 for migration, there will be lag on s1, s3, s5, s8 and x3 T398928 [09:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:19] T398928: Migrate s5 to MariaDB 10.11 - https://phabricator.wikimedia.org/T398928 [09:17:18] (03PS1) 10Marostegui: db1154: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169044 (https://phabricator.wikimedia.org/T398928) [09:18:01] (03PS4) 10Elukey: pyrra: add tonecheck Pyrra config [puppet] - 10https://gerrit.wikimedia.org/r/1165548 (https://phabricator.wikimedia.org/T390706) [09:18:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:19:17] PROBLEM - MariaDB Replica IO: s1 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1154.eqiad.wmnet:3311 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:20:08] (03PS1) 10Btullis: Tweak the limitrange for the dse-k8s-eqiad/spark-history namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169045 (https://phabricator.wikimedia.org/T396617) [09:20:33] PROBLEM - MariaDB Replica IO: s3 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1154.eqiad.wmnet:3313 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:21:51] PROBLEM - MariaDB Replica IO: s5 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1154.eqiad.wmnet:3315 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:22:13] PROBLEM - MariaDB Replica IO: s8 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1154.eqiad.wmnet:3318 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:23:19] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6253/co" [puppet] - 10https://gerrit.wikimedia.org/r/1165548 (https://phabricator.wikimedia.org/T390706) (owner: 10Elukey) [09:25:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1047 (re)pooling @ 10%: Pooling for the first time in es6 T395771', diff saved to https://phabricator.wikimedia.org/P78953 and previous config saved to /var/cache/conftool/dbconfig/20250714-092508-root.json [09:25:13] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [09:25:35] !log jelto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [09:26:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78954 and previous config saved to /var/cache/conftool/dbconfig/20250714-092626-root.json [09:26:37] an-redacteddb1001 are expected [09:26:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 10 hosts with reason: Maintenance [09:26:58] (03CR) 10Marostegui: [C:03+2] db1154: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169044 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [09:27:01] (03CR) 10Elukey: [V:03+1 C:03+2] pyrra: add tonecheck Pyrra config [puppet] - 10https://gerrit.wikimedia.org/r/1165548 (https://phabricator.wikimedia.org/T390706) (owner: 10Elukey) [09:28:04] (03CR) 10Btullis: [C:03+2] Tweak the limitrange for the dse-k8s-eqiad/spark-history namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169045 (https://phabricator.wikimedia.org/T396617) (owner: 10Btullis) [09:28:50] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [09:28:50] (03CR) 10Btullis: [V:03+1 C:03+2] Enable greater timeouts and rewriting for the spark-history service [puppet] - 10https://gerrit.wikimedia.org/r/1168165 (https://phabricator.wikimedia.org/T396617) (owner: 10Btullis) [09:29:51] btullis: ok to merge? [09:30:48] Yes please. [09:31:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P78955 and previous config saved to /var/cache/conftool/dbconfig/20250714-093114-marostegui.json [09:35:17] RECOVERY - MariaDB Replica IO: s1 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:35:27] (03Merged) 10jenkins-bot: Tweak the limitrange for the dse-k8s-eqiad/spark-history namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169045 (https://phabricator.wikimedia.org/T396617) (owner: 10Btullis) [09:36:33] RECOVERY - MariaDB Replica IO: s3 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:36:51] RECOVERY - MariaDB Replica IO: s5 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:36:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1200 gradually with 4 steps - Pool db1200.eqiad.wmnet in after cloning [09:36:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1200.eqiad.wmnet onto db1207.eqiad.wmnet [09:37:13] RECOVERY - MariaDB Replica IO: s8 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:38:49] (03PS1) 10Marostegui: db1161: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169046 (https://phabricator.wikimedia.org/T398928) [09:39:20] (03CR) 10Marostegui: [C:03+2] db1161: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169046 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [09:39:39] jouncebot: nowandnext [09:39:39] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [09:39:39] In 0 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1000) [09:40:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1047 (re)pooling @ 25%: Pooling for the first time in es6 T395771', diff saved to https://phabricator.wikimedia.org/P78957 and previous config saved to /var/cache/conftool/dbconfig/20250714-094014-root.json [09:40:19] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [09:40:30] (03PS3) 10Federico Ceratto: Prepare db1259 for production, remove db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1169034 (https://phabricator.wikimedia.org/T393296) [09:40:44] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1161.eqiad.wmnet with reason: Maintenance [09:40:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1161 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78958 and previous config saved to /var/cache/conftool/dbconfig/20250714-094048-marostegui.json [09:41:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78959 and previous config saved to /var/cache/conftool/dbconfig/20250714-094132-root.json [09:42:02] (03CR) 10Marostegui: [C:04-1] "See comments on the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1169034 (https://phabricator.wikimedia.org/T393296) (owner: 10Federico Ceratto) [09:45:38] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1169049 (https://phabricator.wikimedia.org/T399446) [09:45:43] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1169050 (https://phabricator.wikimedia.org/T399446) [09:46:07] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bookworm [09:46:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T399249)', diff saved to https://phabricator.wikimedia.org/P78960 and previous config saved to /var/cache/conftool/dbconfig/20250714-094621-marostegui.json [09:46:30] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:46:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [09:46:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T399249)', diff saved to https://phabricator.wikimedia.org/P78961 and previous config saved to /var/cache/conftool/dbconfig/20250714-094646-marostegui.json [09:46:46] (03PS1) 10Samtar: IS: Set wgTemplateDataEnableCategoryBrowser default enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169051 (https://phabricator.wikimedia.org/T391064) [09:48:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:48:52] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:49:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78962 and previous config saved to /var/cache/conftool/dbconfig/20250714-094901-root.json [09:49:02] 06SRE, 10decommission-hardware: decommission sretest2007/sretest2008 - https://phabricator.wikimedia.org/T399447 (10MoritzMuehlenhoff) 03NEW [09:49:16] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [09:49:53] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [09:54:38] (03PS4) 10Federico Ceratto: Prepare db1259 for production [puppet] - 10https://gerrit.wikimedia.org/r/1169034 (https://phabricator.wikimedia.org/T393296) [09:55:07] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:55:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1047 (re)pooling @ 40%: Pooling for the first time in es6 T395771', diff saved to https://phabricator.wikimedia.org/P78963 and previous config saved to /var/cache/conftool/dbconfig/20250714-095520-root.json [09:55:27] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [09:56:22] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10999608 (10MoritzMuehlenhoff) I've edited https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests to list Sowmya as well. [09:56:40] (03CR) 10Marostegui: [C:03+1] Prepare db1259 for production [puppet] - 10https://gerrit.wikimedia.org/r/1169034 (https://phabricator.wikimedia.org/T393296) (owner: 10Federico Ceratto) [09:56:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Add db1259 T393296', diff saved to https://phabricator.wikimedia.org/P78964 and previous config saved to /var/cache/conftool/dbconfig/20250714-095649-fceratto.json [09:56:53] T393296: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296 [09:56:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78965 and previous config saved to /var/cache/conftool/dbconfig/20250714-095658-root.json [09:57:03] (03CR) 10Muehlenhoff: [C:03+2] docker::baseimages: Stop building new buster base images [puppet] - 10https://gerrit.wikimedia.org/r/1168675 (https://phabricator.wikimedia.org/T397209) (owner: 10Muehlenhoff) [09:59:24] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [09:59:35] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1000) [10:00:25] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:01:17] (03PS2) 10Cathal Mooney: admin: add sowmya.guru to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/1168404 (https://phabricator.wikimedia.org/T398686) [10:03:09] (03PS14) 10Tiziano Fogli: prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) [10:03:36] (03CR) 10CI reject: [V:04-1] prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:05:35] (03PS15) 10Tiziano Fogli: prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) [10:06:02] (03CR) 10CI reject: [V:04-1] prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:06:51] 06SRE, 10decommission-hardware: decommission sretest2007/sretest2008 - https://phabricator.wikimedia.org/T399447#10999632 (10MoritzMuehlenhoff) [10:07:08] (03CR) 10Kosta Harlan: WIP: Prep hCaptcha config (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [10:07:53] (03CR) 10Zoe: [C:03+2] Redo "Change citoid config for test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164179 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [10:08:33] (03CR) 10Federico Ceratto: [C:03+2] "Ok, removed changes to that file." [puppet] - 10https://gerrit.wikimedia.org/r/1169034 (https://phabricator.wikimedia.org/T393296) (owner: 10Federico Ceratto) [10:08:34] (03PS2) 10Jcrespo: mariadb: Upgrade db2200 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168166 (https://phabricator.wikimedia.org/T399298) [10:08:45] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:08:50] (03Merged) 10jenkins-bot: Redo "Change citoid config for test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164179 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [10:09:51] (03PS1) 10Sergio Gimeno: [Growth]: make limiting add a link available to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169064 (https://phabricator.wikimedia.org/T396382) [10:10:25] RESOLVED: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:53] (03CR) 10Michael Große: [C:03+1] [Growth]: make limiting add a link available to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169064 (https://phabricator.wikimedia.org/T396382) (owner: 10Sergio Gimeno) [10:11:23] 07sre-alert-triage, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Alert in need of triage: SmartNotHealthy (instance dse-k8s-worker1009:9100) - https://phabricator.wikimedia.org/T399160#10999641 (10BTullis) a:03BTullis [10:11:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:12:17] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:13:00] (03PS16) 10Tiziano Fogli: prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) [10:13:03] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:13:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169064 (https://phabricator.wikimedia.org/T396382) (owner: 10Sergio Gimeno) [10:14:42] 07sre-alert-triage, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Exclude rbd devices from /usr/local/sbin/smart-data-dump output - https://phabricator.wikimedia.org/T399160#10999658 (10BTullis) [10:14:53] 07sre-alert-triage, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Exclude rbd devices from /usr/local/sbin/smart-data-dump output - https://phabricator.wikimedia.org/T399160#10999660 (10BTullis) p:05Triage→03Medium [10:17:22] (03CR) 10Cathal Mooney: [C:03+2] admin: add sowmya.guru to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/1168404 (https://phabricator.wikimedia.org/T398686) (owner: 10Cathal Mooney) [10:17:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Commit db1259 again', diff saved to https://phabricator.wikimedia.org/P78966 and previous config saved to /var/cache/conftool/dbconfig/20250714-101735-fceratto.json [10:17:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78967 and previous config saved to /var/cache/conftool/dbconfig/20250714-101741-root.json [10:17:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1047 (re)pooling @ 50%: Pooling for the first time in es6 T395771', diff saved to https://phabricator.wikimedia.org/P78968 and previous config saved to /var/cache/conftool/dbconfig/20250714-101745-root.json [10:17:46] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts sretest2007.codfw.wmnet [10:17:50] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [10:18:03] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:18:45] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:19:34] (03CR) 10Tiziano Fogli: prom/metamonitor: add dead man switch and public endpoint (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:19:57] (03PS1) 10Federico Ceratto: instances.yaml: add db1259 [puppet] - 10https://gerrit.wikimedia.org/r/1169071 [10:21:24] jmm@cumin1003 decommission (PID 1667645) is awaiting input [10:22:17] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10999674 (10cmooney) >>! In T398686#10999608, @MoritzMuehlenhoff wrote: > I've edited https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests... [10:22:17] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:23:10] (03CR) 10Marostegui: [C:03+1] instances.yaml: add db1259 [puppet] - 10https://gerrit.wikimedia.org/r/1169071 (owner: 10Federico Ceratto) [10:25:24] (03PS1) 10Btullis: SMART: exclude the network block storage types from data collection [puppet] - 10https://gerrit.wikimedia.org/r/1169074 (https://phabricator.wikimedia.org/T399160) [10:28:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T399249)', diff saved to https://phabricator.wikimedia.org/P78969 and previous config saved to /var/cache/conftool/dbconfig/20250714-102824-marostegui.json [10:28:29] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:28:53] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: add db1259 [puppet] - 10https://gerrit.wikimedia.org/r/1169071 (owner: 10Federico Ceratto) [10:31:57] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [10:32:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78970 and previous config saved to /var/cache/conftool/dbconfig/20250714-103247-root.json [10:32:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1047 (re)pooling @ 60%: Pooling for the first time in es6 T395771', diff saved to https://phabricator.wikimedia.org/P78971 and previous config saved to /var/cache/conftool/dbconfig/20250714-103251-root.json [10:32:55] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [10:33:59] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10999715 (10Marostegui) [10:35:58] (03CR) 10Reedy: WIP: Prep hCaptcha config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [10:36:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db1207 to s5 depooled - T399430', diff saved to https://phabricator.wikimedia.org/P78972 and previous config saved to /var/cache/conftool/dbconfig/20250714-103600-marostegui.json [10:36:09] T399430: Move db1207 to s5 - https://phabricator.wikimedia.org/T399430 [10:36:54] (03PS1) 10Marostegui: db1207: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169075 (https://phabricator.wikimedia.org/T399430) [10:37:26] jmm@cumin1003 decommission (PID 1667645) is awaiting input [10:38:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 1%: Repooling in s5 for the first time T399430', diff saved to https://phabricator.wikimedia.org/P78973 and previous config saved to /var/cache/conftool/dbconfig/20250714-103827-root.json [10:38:52] (03CR) 10Marostegui: [C:03+2] db1207: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169075 (https://phabricator.wikimedia.org/T399430) (owner: 10Marostegui) [10:40:22] (03PS1) 10Joely Rooke WMDE: Activate feature to resolve changelist wikibase link labels in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169077 (https://phabricator.wikimedia.org/T388685) [10:40:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169077 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [10:41:59] (03CR) 10Effie Mouzeli: [C:03+2] trafficserver: remove mwdebugX XWD entries [puppet] - 10https://gerrit.wikimedia.org/r/1164207 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [10:42:22] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2007.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [10:42:32] !log installing glibc security updates on bullseye [10:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:44] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2007.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [10:42:44] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:42:45] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts sretest2007.codfw.wmnet [10:42:57] 06SRE, 10decommission-hardware: decommission sretest2007/sretest2008 - https://phabricator.wikimedia.org/T399447#10999763 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `sretest2007.codfw.wmnet` - sretest2007.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Aler... [10:43:06] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset for OKryva-WMF - https://phabricator.wikimedia.org/T399436#10999765 (10Dreamy_Jazz) [10:43:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P78974 and previous config saved to /var/cache/conftool/dbconfig/20250714-104332-marostegui.json [10:43:49] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset for OKryva-WMF - https://phabricator.wikimedia.org/T399436#10999767 (10Dreamy_Jazz) Formatted the request in-line with https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ [10:43:56] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts sretest2008.codfw.wmnet [10:45:15] jouncebot: nowandnext [10:45:15] For the next 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1000) [10:45:15] In 2 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1300) [10:47:30] jmm@cumin1003 decommission (PID 1670644) is awaiting input [10:47:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78975 and previous config saved to /var/cache/conftool/dbconfig/20250714-104752-root.json [10:47:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1047 (re)pooling @ 75%: Pooling for the first time in es6 T395771', diff saved to https://phabricator.wikimedia.org/P78976 and previous config saved to /var/cache/conftool/dbconfig/20250714-104756-root.json [10:48:01] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [10:48:39] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1233.eqiad.wmnet onto db1259.eqiad.wmnet [10:48:41] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset for OKryva-WMF - https://phabricator.wikimedia.org/T399436#10999791 (10SCherukuwada) Manager approves. [10:48:42] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1233 - Depool db1233.eqiad.wmnet to then clone it to db1259.eqiad.wmnet - fceratto@cumin1002 [10:48:48] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10999792 (10ops-monitoring-bot) Started cloning db1233.eqiad.wmnet to db1259.eqiad.wmnet - fceratto@cumin1002 [10:49:11] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1233 - Depool db1233.eqiad.wmnet to then clone it to db1259.eqiad.wmnet - fceratto@cumin1002 [10:49:11] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1233.eqiad.wmnet onto db1259.eqiad.wmnet [10:49:18] (03PS1) 10Jelto: gitlab: install correct gitlab-ce package on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1169078 (https://phabricator.wikimedia.org/T399306) [10:49:23] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10999793 (10ops-monitoring-bot) Completed depool of db1233 - Depool db1233.eqiad.wmnet to then clone it to db1259.eqiad.wmnet - fceratto@cumin1002 - fceratto@cumin1002 [10:49:40] (03PS1) 10Marostegui: db2244: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169079 (https://phabricator.wikimedia.org/T399298) [10:50:38] (03CR) 10Marostegui: [C:03+2] db2244: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169079 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui) [10:51:15] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [10:51:15] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2244.codfw.wmnet with reason: Maintenance [10:51:15] (03CR) 10Seanleong-wmde: [C:03+1] Activate feature to resolve changelist wikibase link labels in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169077 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [10:51:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2244 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78978 and previous config saved to /var/cache/conftool/dbconfig/20250714-105118-marostegui.json [10:54:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db1259 T393296', diff saved to https://phabricator.wikimedia.org/P78979 and previous config saved to /var/cache/conftool/dbconfig/20250714-105416-fceratto.json [10:54:21] T393296: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296 [10:54:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 5%: Repooling in s5 for the first time T399430', diff saved to https://phabricator.wikimedia.org/P78980 and previous config saved to /var/cache/conftool/dbconfig/20250714-105427-root.json [10:54:30] T399430: Move db1207 to s5 - https://phabricator.wikimedia.org/T399430 [10:54:31] (03CR) 10Samwilson: [C:03+1] IS: Set wgTemplateDataEnableCategoryBrowser default enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169051 (https://phabricator.wikimedia.org/T391064) (owner: 10Samtar) [10:55:57] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1233.eqiad.wmnet onto db1259.eqiad.wmnet [10:56:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10999816 (10ops-monitoring-bot) Started cloning db1233.eqiad.wmnet to db1259.eqiad.wmnet - fceratto@cumin1002 [10:57:10] (03PS1) 10Btullis: Revert "Create the /usr/share/binfmts directory to fix JRE error" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169080 [10:57:11] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2008.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [10:57:15] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2008.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [10:57:15] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:57:16] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts sretest2008.codfw.wmnet [10:57:28] 06SRE, 10decommission-hardware: decommission sretest2007/sretest2008 - https://phabricator.wikimedia.org/T399447#10999833 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `sretest2008.codfw.wmnet` - sretest2008.codfw.wmnet (**FAIL**) - //Missing DNSName in Nebox fo... [10:57:30] (03PS2) 10Btullis: Revert "Create the /usr/share/binfmts directory to fix JRE error" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169080 (https://phabricator.wikimedia.org/T358866) [10:58:30] (03PS1) 10Muehlenhoff: Remove sretest2007/sretest2008 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1169081 (https://phabricator.wikimedia.org/T399447) [10:58:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2244 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78981 and previous config saved to /var/cache/conftool/dbconfig/20250714-105833-root.json [10:58:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P78982 and previous config saved to /var/cache/conftool/dbconfig/20250714-105839-marostegui.json [10:58:48] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#10999839 (10BTullis) a:03BTullis [10:59:11] 06SRE, 10decommission-hardware, 13Patch-For-Review: decommission sretest2007/sretest2008 - https://phabricator.wikimedia.org/T399447#10999841 (10MoritzMuehlenhoff) [10:59:22] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission sretest2007/sretest2008 - https://phabricator.wikimedia.org/T399447#10999846 (10MoritzMuehlenhoff) [11:00:11] fceratto@cumin1002 clone (PID 382241) is awaiting input [11:00:56] (03CR) 10Muehlenhoff: "One nit, otherwise looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1169078 (https://phabricator.wikimedia.org/T399306) (owner: 10Jelto) [11:03:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1047 (re)pooling @ 90%: Pooling for the first time in es6 T395771', diff saved to https://phabricator.wikimedia.org/P78983 and previous config saved to /var/cache/conftool/dbconfig/20250714-110302-root.json [11:03:06] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [11:06:43] (03CR) 10Vgutierrez: [C:03+1] "fix the commit message and LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [11:09:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 10%: Repooling in s5 for the first time T399430', diff saved to https://phabricator.wikimedia.org/P78984 and previous config saved to /var/cache/conftool/dbconfig/20250714-110932-root.json [11:09:36] T399430: Move db1207 to s5 - https://phabricator.wikimedia.org/T399430 [11:13:35] (03PS1) 10Btullis: Update spark based images to use golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169084 (https://phabricator.wikimedia.org/T390139) [11:13:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2244 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78985 and previous config saved to /var/cache/conftool/dbconfig/20250714-111339-root.json [11:13:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T399249)', diff saved to https://phabricator.wikimedia.org/P78986 and previous config saved to /var/cache/conftool/dbconfig/20250714-111346-marostegui.json [11:13:51] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:14:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [11:14:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T399249)', diff saved to https://phabricator.wikimedia.org/P78987 and previous config saved to /var/cache/conftool/dbconfig/20250714-111410-marostegui.json [11:15:13] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:16:03] (03CR) 10Btullis: [V:03+2 C:03+2] Update spark based images to use golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169084 (https://phabricator.wikimedia.org/T390139) (owner: 10Btullis) [11:16:17] (03CR) 10Btullis: [V:03+2 C:03+2] Revert "Create the /usr/share/binfmts directory to fix JRE error" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169080 (https://phabricator.wikimedia.org/T358866) (owner: 10Btullis) [11:17:26] (03CR) 10Kosta Harlan: [C:03+1] Configure Special:CreateAccount instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167896 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó) [11:18:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1047 (re)pooling @ 100%: Pooling for the first time in es6 T395771', diff saved to https://phabricator.wikimedia.org/P78988 and previous config saved to /var/cache/conftool/dbconfig/20250714-111808-root.json [11:18:19] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [11:21:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169051 (https://phabricator.wikimedia.org/T391064) (owner: 10Samtar) [11:23:52] (03Merged) 10jenkins-bot: IS: Set wgTemplateDataEnableCategoryBrowser default enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169051 (https://phabricator.wikimedia.org/T391064) (owner: 10Samtar) [11:24:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 25%: Repooling in s5 for the first time T399430', diff saved to https://phabricator.wikimedia.org/P78989 and previous config saved to /var/cache/conftool/dbconfig/20250714-112438-root.json [11:24:42] T399430: Move db1207 to s5 - https://phabricator.wikimedia.org/T399430 [11:26:40] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1169051|IS: Set wgTemplateDataEnableCategoryBrowser default enabled (T391064)]] [11:26:50] T391064: Enable template favoriting on all remaining WMF wikis - https://phabricator.wikimedia.org/T391064 [11:28:16] (03CR) 10Muehlenhoff: [C:03+2] Remove sretest2007/sretest2008 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1169081 (https://phabricator.wikimedia.org/T399447) (owner: 10Muehlenhoff) [11:28:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2244 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78990 and previous config saved to /var/cache/conftool/dbconfig/20250714-112844-root.json [11:31:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:31:54] (03PS1) 10Effie Mouzeli: DNM: Remove testserver from conftool and scap [puppet] - 10https://gerrit.wikimedia.org/r/1169091 (https://phabricator.wikimedia.org/T397498) [11:32:50] (03PS1) 10Marostegui: db2162: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169092 (https://phabricator.wikimedia.org/T399298) [11:33:08] (03PS2) 10Jelto: gitlab: install correct gitlab-ce package on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1169078 (https://phabricator.wikimedia.org/T399306) [11:33:27] (03CR) 10Marostegui: [C:03+2] db2162: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169092 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui) [11:33:28] (03CR) 10Jelto: gitlab: install correct gitlab-ce package on bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169078 (https://phabricator.wikimedia.org/T399306) (owner: 10Jelto) [11:34:15] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2162.codfw.wmnet with reason: Maintenance [11:34:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2162 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78991 and previous config saved to /var/cache/conftool/dbconfig/20250714-113418-marostegui.json [11:34:46] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169078 (https://phabricator.wikimedia.org/T399306) (owner: 10Jelto) [11:37:10] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6254/console" [puppet] - 10https://gerrit.wikimedia.org/r/1169078 (https://phabricator.wikimedia.org/T399306) (owner: 10Jelto) [11:37:27] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: install correct gitlab-ce package on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1169078 (https://phabricator.wikimedia.org/T399306) (owner: 10Jelto) [11:39:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 30%: Repooling in s5 for the first time T399430', diff saved to https://phabricator.wikimedia.org/P78992 and previous config saved to /var/cache/conftool/dbconfig/20250714-113943-root.json [11:39:48] T399430: Move db1207 to s5 - https://phabricator.wikimedia.org/T399430 [11:42:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78993 and previous config saved to /var/cache/conftool/dbconfig/20250714-114207-root.json [11:43:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:43:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T399249)', diff saved to https://phabricator.wikimedia.org/P78994 and previous config saved to /var/cache/conftool/dbconfig/20250714-114329-marostegui.json [11:43:34] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:43:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2244 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78995 and previous config saved to /var/cache/conftool/dbconfig/20250714-114350-root.json [11:46:24] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2162 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1169093 (https://phabricator.wikimedia.org/T399456) [11:48:03] !log samtar@deploy1003 samtar: Backport for [[gerrit:1169051|IS: Set wgTemplateDataEnableCategoryBrowser default enabled (T391064)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:48:07] T391064: Enable template favoriting on all remaining WMF wikis - https://phabricator.wikimedia.org/T391064 [11:48:51] !log samtar@deploy1003 samtar: Continuing with sync [11:53:30] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:54:27] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1175 - https://phabricator.wikimedia.org/T399355#11000004 (10Jclark-ctr) a:03Jclark-ctr [11:54:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 50%: Repooling in s5 for the first time T399430', diff saved to https://phabricator.wikimedia.org/P78996 and previous config saved to /var/cache/conftool/dbconfig/20250714-115449-root.json [11:54:55] T399430: Move db1207 to s5 - https://phabricator.wikimedia.org/T399430 [11:57:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78997 and previous config saved to /var/cache/conftool/dbconfig/20250714-115713-root.json [11:58:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:58:29] (03PS3) 10KartikMistry: machinetranslationt: Use s3 model storage for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167854 (https://phabricator.wikimedia.org/T335491) [11:58:29] (03CR) 10KartikMistry: machinetranslationt: Use s3 model storage for production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167854 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [11:58:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P78998 and previous config saved to /var/cache/conftool/dbconfig/20250714-115836-marostegui.json [11:59:22] (03CR) 10Filippo Giunchedi: SMART: exclude the network block storage types from data collection (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169074 (https://phabricator.wikimedia.org/T399160) (owner: 10Btullis) [12:00:07] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1175 - https://phabricator.wikimedia.org/T399355#11000024 (10Jclark-ctr) @BTullis We just received another an-worker RAID ticket. I’m opening a ticket with Dell to get a replacement drive. It should arrive by Wednesday. I’d like to replace it on Thursda... [12:02:11] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169051|IS: Set wgTemplateDataEnableCategoryBrowser default enabled (T391064)]] (duration: 35m 30s) [12:02:15] T391064: Enable template favoriting on all remaining WMF wikis - https://phabricator.wikimedia.org/T391064 [12:03:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.445s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:03:36] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1166798 (https://phabricator.wikimedia.org/T398668) (owner: 10Giuseppe Lavagetto) [12:04:36] (03PS4) 10Fabfur: cache::haproxy: add x_analytics log variable to http frontend too [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) [12:04:42] (03CR) 10Fabfur: cache::haproxy: add x_analytics log variable to http frontend too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [12:06:32] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1175 - https://phabricator.wikimedia.org/T399355#11000041 (10Jclark-ctr) Service Request 212783591 [12:08:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.781s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:09:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 60%: Repooling in s5 for the first time T399430', diff saved to https://phabricator.wikimedia.org/P78999 and previous config saved to /var/cache/conftool/dbconfig/20250714-120955-root.json [12:10:01] T399430: Move db1207 to s5 - https://phabricator.wikimedia.org/T399430 [12:11:39] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nicely done!" [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [12:12:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79000 and previous config saved to /var/cache/conftool/dbconfig/20250714-121218-root.json [12:12:52] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1246.mgmt:22 - https://phabricator.wikimedia.org/T399358#11000051 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Server has power removed T393296 set to failed in netbox [12:13:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P79001 and previous config saved to /var/cache/conftool/dbconfig/20250714-121344-marostegui.json [12:13:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [12:14:04] (03CR) 10KartikMistry: [C:03+2] machinetranslationt: Use s3 model storage for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167854 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [12:14:29] Deploying MinT/machinetranslation ^ [12:15:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: decommission sretest1001 - https://phabricator.wikimedia.org/T399435#11000067 (10Jclark-ctr) [12:15:45] (03Merged) 10jenkins-bot: machinetranslationt: Use s3 model storage for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1167854 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [12:15:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: decommission sretest1001 - https://phabricator.wikimedia.org/T399435#11000068 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:17:19] (03CR) 10Fabfur: [C:03+2] cache::haproxy: add x_analytics log variable to http frontend too [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [12:17:45] (03PS1) 10David Caro: gitlab: allow runners to contact the proxy api [puppet] - 10https://gerrit.wikimedia.org/r/1169098 [12:19:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane failure for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T397851#11000077 (10Jclark-ctr) 05Open→03Resolved Dell will not perform any repairs since the server is currently back online. There is an ongoin... [12:19:27] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [12:21:03] (03CR) 10Jelto: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1169098 (owner: 10David Caro) [12:22:04] (03PS2) 10David Caro: gitlab: allow runners to contact the proxy api [puppet] - 10https://gerrit.wikimedia.org/r/1169098 [12:22:46] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1175 - https://phabricator.wikimedia.org/T399355#11000082 (10Jclark-ctr) [12:23:03] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1189 - https://phabricator.wikimedia.org/T398773#11000085 (10Jclark-ctr) [12:23:54] (03PS3) 10David Caro: gitlab: allow runners to contact the proxy api [puppet] - 10https://gerrit.wikimedia.org/r/1169098 [12:24:17] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [12:25:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 75%: Repooling in s5 for the first time T399430', diff saved to https://phabricator.wikimedia.org/P79002 and previous config saved to /var/cache/conftool/dbconfig/20250714-122500-root.json [12:25:05] T399430: Move db1207 to s5 - https://phabricator.wikimedia.org/T399430 [12:25:41] (03CR) 10Jelto: [C:03+1] "lgtm, thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/1169098 (owner: 10David Caro) [12:27:04] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [12:27:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79003 and previous config saved to /var/cache/conftool/dbconfig/20250714-122724-root.json [12:28:03] (03PS1) 10Effie Mouzeli: k8s::mediawiki_runner: allow outgoing connections to memcached [puppet] - 10https://gerrit.wikimedia.org/r/1169104 (https://phabricator.wikimedia.org/T371881) [12:28:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:28:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T399249)', diff saved to https://phabricator.wikimedia.org/P79004 and previous config saved to /var/cache/conftool/dbconfig/20250714-122852-marostegui.json [12:28:56] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:29:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2193.codfw.wmnet with reason: Maintenance [12:29:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T399249)', diff saved to https://phabricator.wikimedia.org/P79005 and previous config saved to /var/cache/conftool/dbconfig/20250714-122914-marostegui.json [12:31:20] (03CR) 10David Caro: [C:03+2] gitlab: allow runners to contact the proxy api [puppet] - 10https://gerrit.wikimedia.org/r/1169098 (owner: 10David Caro) [12:31:53] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1233 gradually with 4 steps - Pool db1233.eqiad.wmnet in after cloning [12:32:10] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#11000101 (10ops-monitoring-bot) Start pool of db1233 gradually with 4 steps - Pool db1233.eqiad.wmnet in after cloning - fceratto@cumin1002 [12:33:12] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [12:34:34] !log machinetranslationt: Use s3 model storage for production (T335491) [12:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:38] T335491: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 [12:35:27] (03PS2) 10Btullis: SMART: exclude the rados block device types from data collection [puppet] - 10https://gerrit.wikimedia.org/r/1169074 (https://phabricator.wikimedia.org/T399160) [12:38:37] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [12:38:59] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [12:40:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 90%: Repooling in s5 for the first time T399430', diff saved to https://phabricator.wikimedia.org/P79007 and previous config saved to /var/cache/conftool/dbconfig/20250714-124006-root.json [12:40:10] T399430: Move db1207 to s5 - https://phabricator.wikimedia.org/T399430 [12:40:19] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet with reason: MariaDB package upgrade [12:43:08] (03CR) 10Btullis: SMART: exclude the rados block device types from data collection (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169074 (https://phabricator.wikimedia.org/T399160) (owner: 10Btullis) [12:43:41] (03PS3) 10Btullis: SMART: exclude the rados block device types from data collection [puppet] - 10https://gerrit.wikimedia.org/r/1169074 (https://phabricator.wikimedia.org/T399160) [12:43:49] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169074 (https://phabricator.wikimedia.org/T399160) (owner: 10Btullis) [12:46:11] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1169074 (https://phabricator.wikimedia.org/T399160) (owner: 10Btullis) [12:54:42] !log btullis@cumin1003 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [12:55:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T399249)', diff saved to https://phabricator.wikimedia.org/P79010 and previous config saved to /var/cache/conftool/dbconfig/20250714-125504-marostegui.json [12:55:08] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:55:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1207 (re)pooling @ 100%: Repooling in s5 for the first time T399430', diff saved to https://phabricator.wikimedia.org/P79011 and previous config saved to /var/cache/conftool/dbconfig/20250714-125512-root.json [12:55:16] T399430: Move db1207 to s5 - https://phabricator.wikimedia.org/T399430 [12:56:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:58:33] (03PS1) 10Btullis: Apt: Update the thirdparty/bigtop33 report to thirdparty/bigtop34 [puppet] - 10https://gerrit.wikimedia.org/r/1169106 (https://phabricator.wikimedia.org/T380866) [12:59:22] (03CR) 10Btullis: [C:03+2] SMART: exclude the rados block device types from data collection [puppet] - 10https://gerrit.wikimedia.org/r/1169074 (https://phabricator.wikimedia.org/T399160) (owner: 10Btullis) [12:59:23] jouncebot: nowandnext [12:59:23] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [12:59:24] In 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1300) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1300). [13:00:05] danisztls and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] o/ [13:00:10] (03CR) 10Elukey: [C:03+1] "\o/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168839 (https://phabricator.wikimedia.org/T360362) (owner: 10Jelto) [13:00:13] hello [13:02:01] o/ [13:02:44] I can deploy! [13:02:51] thnx [13:03:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168653 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [13:03:23] Lucas_WMDE: o/ if it is not an issue I just added a last minute change [13:03:39] mediawiki-config change for eventgate [13:03:46] ok [13:03:53] thanksss [13:04:10] (03Merged) 10jenkins-bot: Deploy Readers Use Cases Survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168653 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [13:04:18] Lucas_WMDE: since my patch only increases the coverage ratio (from 0 to 0.005) I can't test it further than I already did [13:04:23] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1168653|Deploy Readers Use Cases Survey on enwiki (T398870)]] [13:04:27] T398870: Open-ended survey of enwiki readers - https://phabricator.wikimedia.org/T398870 [13:04:30] 07sre-alert-triage, 10Data-Platform-SRE (2025.07.05 - 2025.07.25), 13Patch-For-Review: Exclude rbd devices from /usr/local/sbin/smart-data-dump output - https://phabricator.wikimedia.org/T399160#11000191 (10BTullis) 05Open→03Resolved [13:04:37] FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:05:04] danisztls: ack [13:06:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1166233 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [13:07:17] l10n-update step seems to take longer than usual (started 13:04:23 and not finished yet) [13:07:23] maybe it’s the first deployment of the new train? [13:08:18] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, dani: Backport for [[gerrit:1168653|Deploy Readers Use Cases Survey on enwiki (T398870)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:35] RESOLVED: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:21] o_O [13:09:27] console output is still stuck at “started l10n-update”… [13:09:32] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, dani: Continuing with sync [13:09:36] ok, apparently I had to reload the page [13:10:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P79013 and previous config saved to /var/cache/conftool/dbconfig/20250714-131011-marostegui.json [13:12:15] (03CR) 10Daimona Eaytoy: "Help wanted for the logo thingy :) Also, once this is merged, I'll make a task to gather input on the unclear settings, tagging the releva" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [13:14:57] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1168653|Deploy Readers Use Cases Survey on enwiki (T398870)]] (duration: 10m 34s) [13:15:01] T398870: Open-ended survey of enwiki readers - https://phabricator.wikimedia.org/T398870 [13:15:43] Lucas_WMDE: thanks! [13:15:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169064 (https://phabricator.wikimedia.org/T396382) (owner: 10Sergio Gimeno) [13:16:29] (03PS1) 10Tiziano Fogli: mailman: avoid pint linting alerts related to backup instance [alerts] - 10https://gerrit.wikimedia.org/r/1169107 [13:16:38] (03Merged) 10jenkins-bot: [Growth]: make limiting add a link available to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169064 (https://phabricator.wikimedia.org/T396382) (owner: 10Sergio Gimeno) [13:16:50] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1169064|[Growth]: make limiting add a link available to all wikis (T396382)]] [13:16:52] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] EventStreamConfig: add the maps.tiles_change_bookworm stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [13:16:54] T396382: Deployment Plan: Allow limiting "Add a Link" to new editors - https://phabricator.wikimedia.org/T396382 [13:16:57] (03PS1) 10Effie Mouzeli: profile::kubernetes::mediawiki_runner: add feature_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1169108 [13:17:00] danisztls: np :) [13:17:21] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1233 gradually with 4 steps - Pool db1233.eqiad.wmnet in after cloning [13:17:22] (03CR) 10CI reject: [V:04-1] profile::kubernetes::mediawiki_runner: add feature_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1169108 (owner: 10Effie Mouzeli) [13:17:23] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1233.eqiad.wmnet onto db1259.eqiad.wmnet [13:17:31] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#11000224 (10ops-monitoring-bot) Completed pool of db1233 gradually with 4 steps - Pool db1233.eqiad.wmnet in after cloning - fceratto@cumin1002 [13:17:37] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#11000225 (10ops-monitoring-bot) Finished cloning db1233.eqiad.wmnet to db1259.eqiad.wmnet - fceratto@cumin1002 [13:17:38] (03CR) 10Jelto: [C:03+2] miscweb: remove design-style-guide [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168839 (https://phabricator.wikimedia.org/T360362) (owner: 10Jelto) [13:18:45] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, sgimeno: Backport for [[gerrit:1169064|[Growth]: make limiting add a link available to all wikis (T396382)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:18:53] testing [13:19:52] looks good on my end @Lucas_WMDE [13:19:52] (03Merged) 10jenkins-bot: miscweb: remove design-style-guide [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168839 (https://phabricator.wikimedia.org/T360362) (owner: 10Jelto) [13:19:57] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, sgimeno: Continuing with sync [13:20:00] ok, thanks for testing! [13:20:48] (03CR) 10Tiziano Fogli: [C:03+1] icinga: Use systemd::sysuser to create the metamonitor system user [puppet] - 10https://gerrit.wikimedia.org/r/1168179 (owner: 10Muehlenhoff) [13:21:08] (03PS4) 10Ssingh: P:cache::haproxy, C:haproxy, hiera: remove OCSP flag and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) [13:25:19] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169064|[Growth]: make limiting add a link available to all wikis (T396382)]] (duration: 08m 28s) [13:25:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P79015 and previous config saved to /var/cache/conftool/dbconfig/20250714-132518-marostegui.json [13:25:23] T396382: Deployment Plan: Allow limiting "Add a Link" to new editors - https://phabricator.wikimedia.org/T396382 [13:26:07] Thank you @Lucas_WMDE [13:26:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [13:26:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:27:17] (03Merged) 10jenkins-bot: EventStreamConfig: add the maps.tiles_change_bookworm stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [13:27:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169106 (https://phabricator.wikimedia.org/T380866) (owner: 10Btullis) [13:27:31] Lucas_WMDE: I don't have a way to test it sadly, so you can proceed once it passes canaries [13:27:31] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1167438|EventStreamConfig: add the maps.tiles_change_bookworm stream (T381565)]] [13:27:37] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [13:27:44] (03PS1) 10Federico Ceratto: db1259.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169110 (https://phabricator.wikimedia.org/T393296) [13:27:44] (03CR) 10Federico Ceratto: "Enable notif - Icinga is green" [puppet] - 10https://gerrit.wikimedia.org/r/1169110 (https://phabricator.wikimedia.org/T393296) (owner: 10Federico Ceratto) [13:27:50] at the end I'll have to roll restart eventgate on k8s, and that will enable the new stream [13:27:51] ok [13:27:53] super [13:28:08] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11000262 (10cmooney) This is still down today. ` 7/14/2025 7:36:04 AM The card has not been fully replaced yet. Troubleshooting is ongoing with the vendor engi... [13:28:18] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6255/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [13:28:23] (03PS2) 10Effie Mouzeli: profile::kubernetes::mediawiki_runner: add feature_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1169108 [13:28:49] (03CR) 10CI reject: [V:04-1] profile::kubernetes::mediawiki_runner: add feature_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1169108 (owner: 10Effie Mouzeli) [13:29:22] !log lucaswerkmeister-wmde@deploy1003 elukey, lucaswerkmeister-wmde: Backport for [[gerrit:1167438|EventStreamConfig: add the maps.tiles_change_bookworm stream (T381565)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:29:30] (03PS3) 10Effie Mouzeli: profile::kubernetes::mediawiki_runner: add feature_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1169108 [13:30:12] !log lucaswerkmeister-wmde@deploy1003 elukey, lucaswerkmeister-wmde: Continuing with sync [13:30:39] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache-text: remove static rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1166798 (https://phabricator.wikimedia.org/T398668) (owner: 10Giuseppe Lavagetto) [13:30:40] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1175 - https://phabricator.wikimedia.org/T399355#11000305 (10BTullis) >>! In T399355#11000023, @Jclark-ctr wrote: > @BTullis We just received another an-worker RAID ticket. I’m opening a ticket with Dell to get a replacement drive... [13:31:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Degraded RAID on an-worker1175 - https://phabricator.wikimedia.org/T399355#11000311 (10BTullis) [13:31:38] (03PS4) 10Effie Mouzeli: profile::kubernetes::mediawiki_runner: add feature_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1169108 [13:32:40] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169108 (owner: 10Effie Mouzeli) [13:33:57] is there time to fit another patch into the deploy window? not sure how far you all are [13:34:44] (03PS1) 10Ebernhardson: typeahead: Add hook to augment api parameters [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169115 (https://phabricator.wikimedia.org/T397732) [13:35:08] (03PS1) 10Ebernhardson: search: Augment typeahead url with test parameters [extensions/WikimediaEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169117 (https://phabricator.wikimedia.org/T397732) [13:35:28] (03PS8) 10Krinkle: varnish: Swap hardcoded upload.wm.o cond for upload_domain in path normalize [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) [13:35:30] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167438|EventStreamConfig: add the maps.tiles_change_bookworm stream (T381565)]] (duration: 07m 59s) [13:35:34] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [13:35:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169117 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [13:36:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169115 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [13:36:52] !log sudo cumin "A:cp" "disable-puppet 'merging CR 1167266'" [13:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:04] I guess we can fit another one in… [13:37:21] i can always do it later, but it looked it might fit [13:37:50] Lucas_WMDE: thanks a lot! [13:38:22] (03PS1) 10Effie Mouzeli: k8s::mediawiki_runner: allow outgoing connections to memcached [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) [13:38:37] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [13:38:52] (03CR) 10CI reject: [V:04-1] k8s::mediawiki_runner: allow outgoing connections to memcached [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [13:39:02] (03CR) 10Lucas Werkmeister (WMDE): "(optional comment, as this was already merged on the master branch)" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169117 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [13:39:10] (03CR) 10Ssingh: [C:03+2] varnish: Swap hardcoded upload.wm.o cond for upload_domain in path normalize [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [13:39:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169115 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [13:39:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169117 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [13:40:00] Lucas_WMDE: i don't quite follow your comment, i'm guessing that its the "javascript hooks are so different than php hooks" and i didn't realize? [13:40:23] Lucas_WMDE: the expectation is i added a listener, and every time the hook fires it gives the function to the listener and the listener invokes it 0-n times [13:40:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T399249)', diff saved to https://phabricator.wikimedia.org/P79016 and previous config saved to /var/cache/conftool/dbconfig/20250714-134026-marostegui.json [13:40:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2197.codfw.wmnet with reason: Maintenance [13:40:30] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:42:03] (03Abandoned) 10Effie Mouzeli: k8s::mediawiki_runner: allow outgoing connections to memcached [puppet] - 10https://gerrit.wikimedia.org/r/1169104 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [13:42:20] !log btullis@cumin1003 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [13:42:38] (03PS2) 10Effie Mouzeli: k8s::mediawiki_runner: allow outgoing connections to memcached [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) [13:44:14] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [13:44:34] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [13:44:50] !log roll restart eventgate-main pods to pick up a new stream - T381565 [13:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:53] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [13:47:08] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [13:47:29] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [13:48:17] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset for OKryva-WMF - https://phabricator.wikimedia.org/T399436#11000368 (10ssingh) [13:50:23] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [13:50:27] !log sudo cumin -b11 "A:cp" "run-puppet-agent --enable 'merging CR 1167266'" [13:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:39] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1259 - Pooling in [13:50:47] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1259 - Pooling in [13:51:04] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#11000381 (10ops-monitoring-bot) Completed depool of db1259 - Pooling in - fceratto@cumin1002 [13:54:18] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1259 gradually with 4 steps - Pooling in [13:54:37] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152#11000387 (10Ottomata) Trying to remember what the process is for volunteer approval. I found https://wikitech.wikimedia.org/wiki/SRE/Production_access#Add_a_volunteer_to_an_ac... [13:55:07] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:55:13] sorry, I had an annoying salesperson at my door [13:55:41] ebernhardson: yes, and that’ll mostly work, but MediaWiki will also “cache” the call to the hook handler [13:55:57] (03Merged) 10jenkins-bot: typeahead: Add hook to augment api parameters [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169115 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [13:55:59] (03Merged) 10jenkins-bot: search: Augment typeahead url with test parameters [extensions/WikimediaEvents] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169117 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [13:56:15] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1169115|typeahead: Add hook to augment api parameters (T397732)]], [[gerrit:1169117|search: Augment typeahead url with test parameters (T397732)]] [13:56:21] T397732: Run a test evaluating fuzziness of completion suggester - https://phabricator.wikimedia.org/T397732 [13:56:24] if code A fires the hook, and then some times passes, and then code B gets loaded and adds a handler for the hook, the handler will still be called [13:56:27] with whichever arguments it was last fired [13:56:46] so JS hooks aren’t necessarily synchronous, unlike PHP hooks [13:56:47] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6, 13Patch-For-Review: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#11000393 (10MoritzMuehlenhoff) 05Open→03Resolved ganeti2019-ganeti2024 have been decommissioned as part of the last server reresh in codfw. [13:57:09] and IMHO it’s probably more confusing than helpful that they’re both called “hooks”, actually :S [13:57:11] Lucas_WMDE: ahh, that is an interesting caveat [13:57:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152#11000396 (10Addshore) [13:58:08] !log lucaswerkmeister-wmde@deploy1003 ebernhardson, lucaswerkmeister-wmde: Backport for [[gerrit:1169115|typeahead: Add hook to augment api parameters (T397732)]], [[gerrit:1169117|search: Augment typeahead url with test parameters (T397732)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:58:33] checking [13:58:59] JS hooks are used more like notifications [13:59:11] Lucas_WMDE: yup, works same in prod as in testing, appends expected params [13:59:24] should be good to continue deploy [13:59:28] !log lucaswerkmeister-wmde@deploy1003 ebernhardson, lucaswerkmeister-wmde: Continuing with sync [13:59:30] ok [14:03:15] jouncebot: now [14:03:15] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [14:03:25] (UTC afternoon backport+config window is overrunning a bit) [14:05:42] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169115|typeahead: Add hook to augment api parameters (T397732)]], [[gerrit:1169117|search: Augment typeahead url with test parameters (T397732)]] (duration: 09m 27s) [14:05:50] T397732: Run a test evaluating fuzziness of completion suggester - https://phabricator.wikimedia.org/T397732 [14:06:22] (03CR) 10Marostegui: [C:03+1] db1259.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169110 (https://phabricator.wikimedia.org/T393296) (owner: 10Federico Ceratto) [14:06:48] (03CR) 10Btullis: [C:03+2] Apt: Update the thirdparty/bigtop33 report to thirdparty/bigtop34 [puppet] - 10https://gerrit.wikimedia.org/r/1169106 (https://phabricator.wikimedia.org/T380866) (owner: 10Btullis) [14:06:54] !log UTC afternoon backport+config window done [14:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:45] !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [14:08:25] !log btullis@cumin1003 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [14:08:49] !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [14:09:44] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11000496 (10ssingh) 05In progress→03Resolved a:03ssingh Resolving this as the above has been completed at least from SRE's side. @aranyap: pl... [14:10:10] (03CR) 10Scott French: [C:03+1] "Thanks for updating these!" [alerts] - 10https://gerrit.wikimedia.org/r/1163007 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [14:10:52] (03PS2) 10Andrew Bogott: Neutron: include a python dependency for wmcs-netns-events [puppet] - 10https://gerrit.wikimedia.org/r/1168648 [14:11:57] 06SRE, 06Infrastructure-Foundations: Upgrade the IDP servers to Bookworm - https://phabricator.wikimedia.org/T354405#11000508 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff IDPs are on Bookworm for quite a while already. [14:13:48] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [14:18:54] (03CR) 10Effie Mouzeli: k8s::mediawiki_runner: allow outgoing connections to memcached (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [14:19:53] 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#11000590 (10ssingh) 05In progress→03Resolved a:03ssingh Please re-open if there any issues. [14:20:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2217.codfw.wmnet with reason: Maintenance [14:21:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T399249)', diff saved to https://phabricator.wikimedia.org/P79017 and previous config saved to /var/cache/conftool/dbconfig/20250714-142103-marostegui.json [14:21:10] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [14:22:16] (03CR) 10Federico Ceratto: [C:03+2] db1259.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169110 (https://phabricator.wikimedia.org/T393296) (owner: 10Federico Ceratto) [14:27:42] (03PS1) 10Ssingh: admin: re-add addshore to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169145 (https://phabricator.wikimedia.org/T399152) [14:29:06] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152#11000633 (10ssingh) [14:29:17] (03CR) 10Ssingh: [V:03+1] "Since last review: two rebases on prod, no code changes, PCC ran." [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1430) [14:31:31] (03CR) 10Ssingh: "I am not setting an expiry for this since access was re-added. Please let me know if that is not the case and an expiry should be set." [puppet] - 10https://gerrit.wikimedia.org/r/1169145 (https://phabricator.wikimedia.org/T399152) (owner: 10Ssingh) [14:31:44] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudnet2005-dev.codfw.wmnet with OS bookworm [14:31:55] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1259 gradually with 4 steps - Pooling in [14:31:59] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1259 gradually with 4 steps - Pooling in [14:32:03] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1259 gradually with 4 steps - Pooling in [14:33:32] !log sudo cumin "A:cp" "disable-puppet 'merging CR 1167695'" [14:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:24] (03CR) 10Alexandros Kosiaris: [C:03+1] profile::kubernetes::mediawiki_runner: add feature_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1169108 (owner: 10Effie Mouzeli) [14:35:59] (03CR) 10Addshore: [C:03+1] admin: re-add addshore to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169145 (https://phabricator.wikimedia.org/T399152) (owner: 10Ssingh) [14:40:02] 06SRE, 10LDAP-Access-Requests: Logstash Access for gergesshamon - https://phabricator.wikimedia.org/T399421#11000681 (10ssingh) Hi. For volunteers requesting access, https://wikitech.wikimedia.org/wiki/SRE/Production_access#Add_a_volunteer_to_an_access_group needs to be followed. Essentially: 1. You must hav... [14:42:06] (03CR) 10Alexandros Kosiaris: [C:04-1] k8s::mediawiki_runner: allow outgoing connections to memcached (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [14:43:01] (03CR) 10Ssingh: [V:03+1 C:03+2] P:cache::haproxy, C:haproxy, hiera: remove OCSP flag and monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1167695 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [14:43:46] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Increase the default batch size of puppet.run() - https://phabricator.wikimedia.org/T397687#11000726 (10elukey) Possible related task: T280622 The concern may be that cookbooks running in parallel at the same time could put some strain the puppetserver... [14:45:58] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Increase the default batch size of puppet.run() - https://phabricator.wikimedia.org/T397687#11000746 (10joanna_borun) p:05Triage→03Medium [14:47:01] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Flaky spicerack icinga unit tests - https://phabricator.wikimedia.org/T397833#11000747 (10joanna_borun) p:05Triage→03Low [14:47:43] 06SRE, 10LDAP-Access-Requests: Logstash Access for gergesshamon - https://phabricator.wikimedia.org/T399421#11000757 (10ssingh) Additionally please note that for logstash-access, you can simply request it via https://idm.wikimedia.org/. See https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access. [14:47:55] (03PS1) 10CDobbins: add start of recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 [14:48:07] 06SRE, 06Infrastructure-Foundations, 10SRE Observability (FY2025/2026-Q1): librenms-syslog leaks memory - https://phabricator.wikimedia.org/T397427#11000758 (10cmooney) p:05Triage→03Low [14:49:57] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169145 (https://phabricator.wikimedia.org/T399152) (owner: 10Ssingh) [14:50:05] (03CR) 10Ssingh: [C:03+2] admin: re-add addshore to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169145 (https://phabricator.wikimedia.org/T399152) (owner: 10Ssingh) [14:50:17] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage [14:52:03] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Increase the default batch size of puppet.run() - https://phabricator.wikimedia.org/T397687#11000793 (10Volans) @JMeybohm do you have a specific use case that cannot/is hard to solve simply changing the `batch_size` of the call to `puppet.run()`? https:... [14:52:46] (03PS3) 10Effie Mouzeli: k8s::mediawiki_runner: allow outgoing connections to memcached [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) [14:52:51] (03CR) 10Effie Mouzeli: k8s::mediawiki_runner: allow outgoing connections to memcached (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [14:53:03] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [14:53:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152#11000805 (10ssingh) 05Open→03Resolved a:03ssingh ` sukhe@krb1002:~$ sudo manage_principals.py create addshore --email_address=wikimedia@addshor... [14:54:14] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage [14:55:00] !log sudo cumin 'O:alerting_host' 'run-puppet-agent' :T399114 [14:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:03] T399114: Remove OCSP monitoring and related bits - https://phabricator.wikimedia.org/T399114 [14:58:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T399249)', diff saved to https://phabricator.wikimedia.org/P79018 and previous config saved to /var/cache/conftool/dbconfig/20250714-145800-marostegui.json [14:58:05] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [14:59:40] !log btullis@cumin1003 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [15:00:14] (03CR) 10Ssingh: [C:03+2] nagios_common: remove check_ssl_cdn_ocsp* [puppet] - 10https://gerrit.wikimedia.org/r/1167698 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [15:01:50] !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for db1259.eqiad.wmnet [15:01:51] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1259.eqiad.wmnet [15:02:09] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1259 gradually with 4 steps - Pooling in [15:02:28] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#11000856 (10ops-monitoring-bot) Start pool of db1259 gradually with 4 steps - Pooling in - fceratto@cumin1002 [15:02:57] (03CR) 10Giuseppe Lavagetto: [C:03+2] profile::hcaptcha: don't serve / or robots.txt [puppet] - 10https://gerrit.wikimedia.org/r/1168149 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [15:04:58] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] [15:05:39] !log sukhe@dns1004 START - running authdns-update [15:06:25] !log sukhe@dns1004 END - running authdns-update [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:46] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:13:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P79020 and previous config saved to /var/cache/conftool/dbconfig/20250714-151308-marostegui.json [15:13:54] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152#11000882 (10MoritzMuehlenhoff) >>! In T399152#11000387, @Ottomata wrote: > So, @Ladsgroup's comments satisfies the first. When SRE fulfills this, sh... [15:14:48] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [15:15:13] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:15:25] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1259 gradually with 4 steps - Pooling in [15:15:45] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1259 gradually with 4 steps - Pooling in [15:15:59] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#11000889 (10ops-monitoring-bot) Start pool of db1259 gradually with 4 steps - Pooling in - fceratto@cumin1002 [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:46] (03CR) 10Eevans: [C:03+2] adjust sessionstore disk utilization for JBOD [alerts] - 10https://gerrit.wikimedia.org/r/1163007 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [15:17:07] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2005-dev.codfw.wmnet with OS bookworm [15:17:22] (03PS1) 10Elukey: python-webapp: add external-services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169162 (https://phabricator.wikimedia.org/T398640) [15:21:24] (03CR) 10Elukey: [C:03+2] services: configure tegola in codfw to use maps-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [15:21:52] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=tegola,name=codfw [15:22:46] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=tegola-vector-tiles,name=codfw [15:23:38] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11000942 (10Jgiannelos) Could that be related to this ticket? https://phabricator.wikimedia.org/T383127 [15:26:27] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11000966 (10elukey) >>! In T381565#11000942, @Jgiannelos wrote: > Could that be related to this ticket? https://phabricator.wikimedia.org/T383127 @Jgiannelos the task was opened in J... [15:27:44] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [15:28:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P79022 and previous config saved to /var/cache/conftool/dbconfig/20250714-152815-marostegui.json [15:29:24] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db2200 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168166 (https://phabricator.wikimedia.org/T399298) (owner: 10Jcrespo) [15:30:05] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1530). [15:37:49] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [15:43:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T399249)', diff saved to https://phabricator.wikimedia.org/P79024 and previous config saved to /var/cache/conftool/dbconfig/20250714-154322-marostegui.json [15:43:29] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:43:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2224.codfw.wmnet with reason: Maintenance [15:43:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T399249)', diff saved to https://phabricator.wikimedia.org/P79025 and previous config saved to /var/cache/conftool/dbconfig/20250714-154346-marostegui.json [15:46:10] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1259 gradually with 4 steps - Pooling in [15:46:15] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl2001 [15:46:20] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#11001077 (10ops-monitoring-bot) Completed pool of db1259 gradually with 4 steps - Pooling in - fceratto@cumin1002 [15:46:26] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl2001 [15:47:44] (03PS1) 10Elukey: profile::maps::osm_master: allow tilerator for kubepods on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1169174 (https://phabricator.wikimedia.org/T381565) [15:49:46] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1169174 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [15:53:16] (03CR) 10Elukey: [V:03+1 C:03+2] profile::maps::osm_master: allow tilerator for kubepods on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1169174 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [15:54:46] (03PS1) 10Zabe: Fix join conditions in categorylinks read new code [extensions/FlaggedRevs] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169180 (https://phabricator.wikimedia.org/T399431) [15:56:16] jouncebot: nowandnext [15:56:16] For the next 0 hour(s) and 3 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1530) [15:56:16] In 1 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1700) [15:56:16] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [15:56:16] In 1 hour(s) and 3 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1700) [15:56:28] (03CR) 10Zabe: [C:03+2] Fix join conditions in categorylinks read new code [extensions/FlaggedRevs] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169180 (https://phabricator.wikimedia.org/T399431) (owner: 10Zabe) [15:58:53] (03CR) 10Dreamy Jazz: "Thanks for the information. I'll bear that in mind in the future." [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) (owner: 10Dreamy Jazz) [15:59:29] (03PS2) 10Scott French: php8.1: rebuild to pick up new php version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169170 [15:59:33] 10ops-codfw, 06SRE, 06DC-Ops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#11001176 (10Jhancock.wm) 05Open→03Resolved @cmooney saved a csv of the data (as a just in case) but deleted them. They were never ours so it should be good. Ty for... [16:01:22] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest2001 [16:01:32] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2001 [16:02:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#11001190 (10Jhancock.wm) @jhathaway Hi! checking back in to see if any assistance is needed. [16:03:03] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:03:14] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-b7-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T399136#11001199 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm acknowledged. regarding this: T394847 [16:03:45] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:04:16] 10ops-codfw, 06SRE, 06DC-Ops: mc-misc2001 won't power up - https://phabricator.wikimedia.org/T395526#11001207 (10Jhancock.wm) [16:04:18] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for mc-misc2001.mgmt:22 - https://phabricator.wikimedia.org/T396914#11001210 (10Jhancock.wm) →14Duplicate dup:03T395526 [16:06:22] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [16:07:17] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:07:21] (03PS3) 10Scott French: php8.1: rebuild to pick up new php version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169170 [16:11:09] (03Merged) 10jenkins-bot: Fix join conditions in categorylinks read new code [extensions/FlaggedRevs] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169180 (https://phabricator.wikimedia.org/T399431) (owner: 10Zabe) [16:11:32] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169180|Fix join conditions in categorylinks read new code (T399431)]] [16:11:40] T399431: Special:PendingChanges shows wrong and duplicated results if used with category - https://phabricator.wikimedia.org/T399431 [16:13:29] !log zabe@deploy1003 zabe: Backport for [[gerrit:1169180|Fix join conditions in categorylinks read new code (T399431)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:14:13] !log zabe@deploy1003 zabe: Continuing with sync [16:14:35] (03PS1) 10Ssingh: admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) [16:14:52] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [16:15:31] (03CR) 10CI reject: [V:04-1] admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh) [16:15:59] (03CR) 10Scott French: php8.1: rebuild to pick up new php version (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169170 (owner: 10Scott French) [16:18:48] (03CR) 10Alexandros Kosiaris: [C:03+1] php8.1: rebuild to pick up new php version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169170 (owner: 10Scott French) [16:19:36] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169180|Fix join conditions in categorylinks read new code (T399431)]] (duration: 08m 04s) [16:19:45] T399431: Special:PendingChanges shows wrong and duplicated results if used with category - https://phabricator.wikimedia.org/T399431 [16:22:09] (03CR) 10Ssingh: "Failing because no expiry date is set (trying to get clarity on that) and the invalid UID." [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh) [16:23:19] (03CR) 10BCornwall: [C:03+1] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1169050 (https://phabricator.wikimedia.org/T399446) (owner: 10Gerrit maintenance bot) [16:23:39] (03Abandoned) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1165186 (owner: 10Ncmonitor) [16:24:49] !log dancy@deploy1003 Installing scap version "4.188.0" for 180 host(s) [16:24:58] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [16:25:06] !log dancy@deploy1003 Installing scap version "4.188.0" for 2 host(s) [16:27:34] !log dancy@deploy1003 Installing scap version "4.188.0" for 180 host(s) [16:27:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11001359 (10RobH) [16:29:52] 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#11001364 (10sowmya.guru) Thanks everyone! :) [16:30:29] (03PS1) 10Zabe: Set categorylinks to read new on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169198 (https://phabricator.wikimedia.org/T397912) [16:32:42] (03PS2) 10Zabe: Set categorylinks to read new on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169198 (https://phabricator.wikimedia.org/T397912) [16:32:44] !log dancy@deploy1003 Installing scap version "4.188.0" for 1 host(s) [16:39:37] !log fceratto@cumin1002 dbctl restore of MediaWiki config (dc=all) from a [16:41:58] (03PS1) 10Vgutierrez: hiera: use the alt chain on half upload@magru for measure cert [puppet] - 10https://gerrit.wikimedia.org/r/1169200 (https://phabricator.wikimedia.org/T398596) [16:42:14] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169200 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [16:47:37] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169200 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [16:48:37] !log dancy@deploy1003 Installing scap version "4.188.1" for 2 host(s) [16:50:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T399249)', diff saved to https://phabricator.wikimedia.org/P79028 and previous config saved to /var/cache/conftool/dbconfig/20250714-165009-marostegui.json [16:50:14] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [16:50:23] !log dancy@deploy1003 Installation of scap version "4.188.1" completed for 2 hosts [16:51:23] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:52:17] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.826 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:52:17] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:53:03] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:53:43] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#11001523 (10Jhancock.wm) @BTullis can you depool this server for me so i can do the upgrade. It will require a reboot (or two) of the server. [16:53:45] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1700) [17:00:05] ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T1700) [17:05:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P79029 and previous config saved to /var/cache/conftool/dbconfig/20250714-170517-marostegui.json [17:20:14] (03PS1) 10Lucas Werkmeister: tools-static: Update proxy_pass [puppet] - 10https://gerrit.wikimedia.org/r/1169207 (https://phabricator.wikimedia.org/T399483) [17:20:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P79030 and previous config saved to /var/cache/conftool/dbconfig/20250714-172024-marostegui.json [17:20:51] (03CR) 10Ssingh: [C:03+1] "Verified hostnames, alt cert path, host-level hiera override,and magru enabled certificates." [puppet] - 10https://gerrit.wikimedia.org/r/1169200 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [17:23:05] (03PS3) 10Ssingh: hiera: service.yaml: use better aliasing for text/upload [puppet] - 10https://gerrit.wikimedia.org/r/1168192 [17:24:29] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6257/" [puppet] - 10https://gerrit.wikimedia.org/r/1168192 (owner: 10Ssingh) [17:24:53] (03CR) 10Lucas Werkmeister: "Disclaimer: I have absolutely no idea if this will work or not." [puppet] - 10https://gerrit.wikimedia.org/r/1169207 (https://phabricator.wikimedia.org/T399483) (owner: 10Lucas Werkmeister) [17:27:16] (03CR) 10Ssingh: [V:03+1] "PCC for lvs1016 has changed order but should otherwise be a NOOP." [puppet] - 10https://gerrit.wikimedia.org/r/1168192 (owner: 10Ssingh) [17:30:58] (03PS1) 10Ebernhardson: Add RKD to WDQS allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1169208 (https://phabricator.wikimedia.org/T398820) [17:32:53] (03PS1) 10Ebernhardson: cirrus: Drop absented periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1169209 [17:32:53] (03PS1) 10Ebernhardson: cirrus: Drop absented periodic_job (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/1169210 [17:35:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T399249)', diff saved to https://phabricator.wikimedia.org/P79031 and previous config saved to /var/cache/conftool/dbconfig/20250714-173531-marostegui.json [17:35:37] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:35:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2229.codfw.wmnet with reason: Maintenance [17:35:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2229 (T399249)', diff saved to https://phabricator.wikimedia.org/P79032 and previous config saved to /var/cache/conftool/dbconfig/20250714-173554-marostegui.json [17:40:23] (03PS2) 10Ebernhardson: Add RKD to WDQS allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1169208 (https://phabricator.wikimedia.org/T398820) [17:41:50] (03PS1) 10Ssingh: admin: use correct uid for missguru [puppet] - 10https://gerrit.wikimedia.org/r/1169213 (https://phabricator.wikimedia.org/T398686) [17:46:12] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152#11001806 (10Addshore) Can confirm my access is back, many thanks all! [17:48:37] (03CR) 10Ssingh: [V:03+1 C:03+2] P:cache::haproxy: remove obsolete do_ocsp [puppet] - 10https://gerrit.wikimedia.org/r/1167686 (https://phabricator.wikimedia.org/T399114) (owner: 10Ssingh) [17:49:20] !log sudo cumin "A:cp" "disable-puppet 'merging CR 1167686'" [17:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:54] !log sudo cumin -b31 "A:cp" "run-puppet-agent --enable 'merging CR 1167686'" [17:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:07] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:59:09] !log eevans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on aqs1012.eqiad.wmnet with reason: Drive replacement [17:59:15] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11001818 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=33227d80-c2fb-47d6-8915-7a50d5148bca) set by eevans@cumin1003 for 4:00:00 on 1 host(s) and their services with reason: Drive replacem... [18:00:54] (03PS1) 10Alexandros Kosiaris: mtail: Remove tilerator from tests [puppet] - 10https://gerrit.wikimedia.org/r/1169217 [18:00:54] (03PS1) 10Alexandros Kosiaris: deployment: Remove tilerator from scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/1169218 [18:00:55] (03PS1) 10Alexandros Kosiaris: admin: Empty out kartotherian-admin [puppet] - 10https://gerrit.wikimedia.org/r/1169219 [18:00:55] (03PS1) 10Alexandros Kosiaris: admin: Remove tilerator/tileratorui system users [puppet] - 10https://gerrit.wikimedia.org/r/1169220 [18:00:56] (03PS1) 10Alexandros Kosiaris: maps: Cleanup DB grants, add tegola, prep tilerator for removal [puppet] - 10https://gerrit.wikimedia.org/r/1169221 [18:00:57] (03PS1) 10Alexandros Kosiaris: maps: Add tegola user in DB, mark tilerator for removal [puppet] - 10https://gerrit.wikimedia.org/r/1169222 [18:01:01] (03PS1) 10Alexandros Kosiaris: DNM: tilerator: Remove as much as possible of the last cruft [puppet] - 10https://gerrit.wikimedia.org/r/1169223 [18:01:29] (03CR) 10Dzahn: [C:03+2] logspam.pl: Avoid consolidation of wrapped error message [puppet] - 10https://gerrit.wikimedia.org/r/1167932 (https://phabricator.wikimedia.org/T399239) (owner: 10Ahmon Dancy) [18:03:29] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169213 (https://phabricator.wikimedia.org/T398686) (owner: 10Ssingh) [18:03:35] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:45] (03CR) 10Ssingh: [C:03+2] admin: use correct uid for missguru [puppet] - 10https://gerrit.wikimedia.org/r/1169213 (https://phabricator.wikimedia.org/T398686) (owner: 10Ssingh) [18:04:05] (03CR) 10CI reject: [V:04-1] mtail: Remove tilerator from tests [puppet] - 10https://gerrit.wikimedia.org/r/1169217 (owner: 10Alexandros Kosiaris) [18:04:36] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169170 (owner: 10Scott French) [18:09:53] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11001832 (10VRiley-WMF) KN09N7919I0709R1S is in slot 0 and being replaced with KJA8N5701I0808167 KN09N7919I0709R1T is in slot 2. Currently, we will test it with a single drive replacment to see if it is sucs... [18:10:39] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11001833 (10VRiley-WMF) Unit is powering back on. [18:12:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T399249)', diff saved to https://phabricator.wikimedia.org/P79034 and previous config saved to /var/cache/conftool/dbconfig/20250714-181238-marostegui.json [18:12:42] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:19:54] 06SRE, 07SRE-Unowned: The ops-maint-gcal.js script is missing support for some vendors - https://phabricator.wikimedia.org/T381680#11001861 (10Dzahn) Devil's advocate here: Is it possible we are spending more time on fixing parse issues (that are expected to keep happening because upstreams will always change... [18:20:04] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11001862 (10cmooney) Just an update on this, we unfortunately did not have a spare optic of the right kind so dc-ops are ordering one with expedited delivery. We... [18:25:27] (03PS1) 10DDesouza: Readers Use Cases Survey: Set token param name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169226 (https://phabricator.wikimedia.org/T398870) [18:26:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169226 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [18:27:06] (03CR) 10Dzahn: [C:03+2] zuul: Add profile::zuul::haproxy for Cloud VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1166006 (https://phabricator.wikimedia.org/T396936) (owner: 10BryanDavis) [18:27:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P79035 and previous config saved to /var/cache/conftool/dbconfig/20250714-182745-marostegui.json [18:29:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11001879 (10VRiley-WMF) Hey @BCornwall I have swapped the cables, would you be able to test this again? (I was going to try to reimage it, but didn't know what version of bullseye to put on it, I wa... [18:29:16] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11001880 (10Eevans) >>! In T396970#11001833, @VRiley-WMF wrote: > Unit is powering back on. It doesn't seem to have booted. :( [18:30:51] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1167962 (https://phabricator.wikimedia.org/T187434) (owner: 10Dzahn) [18:31:19] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11001886 (10Eevans) {F64405933} [18:34:54] (03PS7) 10Dzahn: scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) [18:35:47] (03CR) 10Dzahn: "I am going with "That being said, removing the hardcoded home path and referencing instead the value in scap::user is an improvement on it" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [18:37:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#11001895 (10FCeratto-WMF) 05In progress→03Resolved [18:39:18] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to Dashboards in Superset for OKryva-WMF - https://phabricator.wikimedia.org/T399436#11001899 (10ssingh) @SCherukuwada: Hi, thanks for the approval above. Since the email ends in `-ctr`, can you let us know the cont... [18:42:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P79036 and previous config saved to /var/cache/conftool/dbconfig/20250714-184253-marostegui.json [18:58:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T399249)', diff saved to https://phabricator.wikimedia.org/P79037 and previous config saved to /var/cache/conftool/dbconfig/20250714-185800-marostegui.json [18:58:05] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [19:07:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:09:54] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1169234 [19:14:08] (03PS1) 10Ssingh: admin: data_test.py: bump system_uid_max to 499999 [puppet] - 10https://gerrit.wikimedia.org/r/1169235 (https://phabricator.wikimedia.org/T355663) [19:14:10] (03PS2) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1169234 [19:14:10] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1168179 (owner: 10Muehlenhoff) [19:14:53] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [19:14:53] (03CR) 10Ssingh: "CI fix in I315b5fcf6ab242e330b3be20c84f5ab0102ade65" [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh) [19:15:13] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:16:43] (03PS2) 10Ssingh: admin: data_test.py: bump system_uid_max to 499999 [puppet] - 10https://gerrit.wikimedia.org/r/1169235 (https://phabricator.wikimedia.org/T355663) [19:22:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:24:33] (03PS3) 10Dreamy Jazz: Enable hCaptcha on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) [19:24:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54226 bytes in 8.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:25:24] (03CR) 10Dreamy Jazz: WIP: Prep hCaptcha config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [19:27:22] (03PS8) 10Dreamy Jazz: WIP: Prep hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [19:28:20] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11002060 (10Jhancock.wm) @Clement_Goubert hey this is a test server that could be a 1 CPU alternative for your wikikube-worker servers. Could you set the partman for this server how you would prefer it... [19:30:48] (03PS3) 10Herron: Pyrra-filesystem: purge unmanaged files from config directory [puppet] - 10https://gerrit.wikimedia.org/r/1169234 (https://phabricator.wikimedia.org/T302995) [19:39:52] 10ops-codfw, 06DC-Ops: Unresponsive management for mc-misc2001.mgmt:22 - https://phabricator.wikimedia.org/T399494 (10phaultfinder) 03NEW [19:42:50] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [19:43:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11002108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye [19:57:10] (03CR) 10Dreamy Jazz: WIP: Prep hCaptcha config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [19:58:05] !log dancy@deploy1003 Installing scap version "4.188.0" for 1 host(s) [19:58:29] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T2000). Please do the needful. [20:00:05] danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:29] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 9.886 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:04:46] (03PS9) 10Dreamy Jazz: Set hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [20:06:05] jouncebot: nowandnext [20:06:05] For the next 0 hour(s) and 53 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T2000) [20:06:05] In 0 hour(s) and 53 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T2100) [20:06:23] Anyone handling the backport window? [20:06:49] If not, I can backport and then backport a change I want to deploy. [20:07:05] danisztls: Do you need someone to deploy your change? [20:07:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11002158 (10BCornwall) Looks like that worked, it's booting PXE now. Thanks! [20:08:32] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1017.eqiad.wmnet with OS bullseye [20:08:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11002159 (10VRiley-WMF) Awesome! is this okay to close out? [20:08:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11002160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs1017 (**FAIL**) - Removed f... [20:08:53] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bookworm [20:08:57] danisztls: You here for the window? [20:09:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1169235 (https://phabricator.wikimedia.org/T355663) (owner: 10Ssingh) [20:09:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11002162 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm [20:09:47] I'm going to deploy my change first then given that danisztls appears to be AFK [20:10:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [20:10:47] sorry [20:10:51] late, but here [20:10:54] (03Merged) 10jenkins-bot: Set hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [20:11:09] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1148390|Set hCaptcha config (T382148)]] [20:11:13] T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148 [20:11:15] Hello. My one has merged but it shouldn't take too long to complete. [20:11:22] Dreamy_Jazz: I do need someone to deploy, thanks [20:11:27] Will you be able to test the changes? [20:11:42] I can deploy your change after my one [20:13:07] !log dreamyjazz@deploy1003 dreamyjazz, reedy: Backport for [[gerrit:1148390|Set hCaptcha config (T382148)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:18:47] !log dreamyjazz@deploy1003 dreamyjazz, reedy: Continuing with sync [20:24:23] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148390|Set hCaptcha config (T382148)]] (duration: 13m 14s) [20:24:27] T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148 [20:24:38] danisztls: I will start on your change now [20:25:28] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1017.eqiad.wmnet with reason: host reimage [20:28:24] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1017.eqiad.wmnet with reason: host reimage [20:35:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169226 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [20:36:38] (03Merged) 10jenkins-bot: Readers Use Cases Survey: Set token param name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169226 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [20:36:45] (03PS1) 10Zabe: Set categorylinks to read new on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169246 (https://phabricator.wikimedia.org/T397912) [20:36:49] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1169226|Readers Use Cases Survey: Set token param name (T398870)]] [20:36:53] T398870: Open-ended survey of enwiki readers - https://phabricator.wikimedia.org/T398870 [20:37:18] danisztls: Is there anything you can test with this? [20:37:49] Dreamy_Jazz: yes [20:38:03] Dreamy_Jazz: looks good [20:38:11] (03PS5) 10BryanDavis: varnish: Allow customising "contact noc@" error [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [20:38:19] The change hasn't been merged yet. [20:38:24] So it's not been applied. [20:38:30] ow ok [20:38:36] (03PS3) 10Zabe: Set categorylinks to read new on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169198 (https://phabricator.wikimedia.org/T397912) [20:38:44] I'll let you know when it's ready to test [20:38:46] !log dreamyjazz@deploy1003 dani, dreamyjazz: Backport for [[gerrit:1169226|Readers Use Cases Survey: Set token param name (T398870)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:38:55] Now it should be ready [20:39:34] (Some of the test servers might have been running the new changes just as you tested the feature, but if you could re-try now that would be good) [20:40:34] it looks good [20:40:47] Thanks. proceeding. [20:40:52] !log dreamyjazz@deploy1003 dani, dreamyjazz: Continuing with sync [20:42:47] (03CR) 10Ssingh: [C:03+2] admin: data_test.py: bump system_uid_max to 499999 [puppet] - 10https://gerrit.wikimedia.org/r/1169235 (https://phabricator.wikimedia.org/T355663) (owner: 10Ssingh) [20:45:10] (03PS2) 10Ssingh: admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) [20:45:40] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [20:45:55] (03CR) 10CI reject: [V:04-1] admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh) [20:46:19] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169226|Readers Use Cases Survey: Set token param name (T398870)]] (duration: 09m 30s) [20:46:23] T398870: Open-ended survey of enwiki readers - https://phabricator.wikimedia.org/T398870 [20:47:58] (03CR) 10Ssingh: "16:45:38 Failed validating 'maximum' in schema['properties']['users']['additionalProperties']['properties']['uid']:" [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh) [20:48:45] brett@cumin2002 reimage (PID 3269879) is awaiting input [20:49:14] (03PS1) 10Dzahn: phabricator::migration: fix permissions on scap base path [puppet] - 10https://gerrit.wikimedia.org/r/1169249 (https://phabricator.wikimedia.org/T399480) [20:49:38] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [20:49:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1017.eqiad.wmnet with OS bookworm [20:49:46] (03CR) 10CI reject: [V:04-1] phabricator::migration: fix permissions on scap base path [puppet] - 10https://gerrit.wikimedia.org/r/1169249 (https://phabricator.wikimedia.org/T399480) (owner: 10Dzahn) [20:49:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11002294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm completed: - lvs1017 (**PASS**) - Removed from Puppet... [20:50:50] (03PS1) 10Ssingh: admin: schema.yaml: bump max for uid [puppet] - 10https://gerrit.wikimedia.org/r/1169250 (https://phabricator.wikimedia.org/T355663) [20:51:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11002298 (10BCornwall) [20:51:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11002300 (10BCornwall) 05In progress→03Resolved We're all set. Thank you for all your help, @VRiley-WMF! [20:54:59] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169250 (https://phabricator.wikimedia.org/T355663) (owner: 10Ssingh) [20:55:24] (03CR) 10Ssingh: [C:03+2] admin: schema.yaml: bump max for uid [puppet] - 10https://gerrit.wikimedia.org/r/1169250 (https://phabricator.wikimedia.org/T355663) (owner: 10Ssingh) [20:56:10] !log dancy@deploy1003 Installing scap version "4.188.2" for 2 host(s) [20:57:22] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh) [20:57:57] !log dancy@deploy1003 Installation of scap version "4.188.2" completed for 2 hosts [20:58:16] (03PS2) 10Dzahn: phabricator::migration: fix permissions on scap base path [puppet] - 10https://gerrit.wikimedia.org/r/1169249 (https://phabricator.wikimedia.org/T399480) [20:58:48] (03CR) 10CI reject: [V:04-1] phabricator::migration: fix permissions on scap base path [puppet] - 10https://gerrit.wikimedia.org/r/1169249 (https://phabricator.wikimedia.org/T399480) (owner: 10Dzahn) [21:00:05] Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T2100). [21:04:11] (03PS3) 10Dzahn: phabricator::migration: fix permissions on scap base path [puppet] - 10https://gerrit.wikimedia.org/r/1169249 (https://phabricator.wikimedia.org/T399480) [21:04:38] (03CR) 10CI reject: [V:04-1] phabricator::migration: fix permissions on scap base path [puppet] - 10https://gerrit.wikimedia.org/r/1169249 (https://phabricator.wikimedia.org/T399480) (owner: 10Dzahn) [21:07:38] (03PS4) 10Dzahn: phabricator::migration: fix permissions on scap base path [puppet] - 10https://gerrit.wikimedia.org/r/1169249 (https://phabricator.wikimedia.org/T399480) [21:11:37] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1169249/6263/phab1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1169249 (https://phabricator.wikimedia.org/T399480) (owner: 10Dzahn) [21:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:52] (03PS3) 10Ryan Kemper: Add RKD to WDQS allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1169208 (https://phabricator.wikimedia.org/T398820) (owner: 10Ebernhardson) [21:17:51] (03CR) 10Bking: [C:03+1] Add RKD to WDQS allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1169208 (https://phabricator.wikimedia.org/T398820) (owner: 10Ebernhardson) [21:18:04] (03CR) 10Ryan Kemper: [C:03+2] Add RKD to WDQS allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1169208 (https://phabricator.wikimedia.org/T398820) (owner: 10Ebernhardson) [21:28:11] !log ryankemper@cumin1003 START - Cookbook sre.wdqs.restart [21:44:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:54:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:55:07] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:59:53] PROBLEM - Host aqs1012 is DOWN: PING CRITICAL - Packet loss = 100% [22:01:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:03:35] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:14:55] (03CR) 10Scott French: "@btullis@wikimedia.org - Can the former `thirdparty/bigtop33` component safely be deleted now as described in [0]?" [puppet] - 10https://gerrit.wikimedia.org/r/1169106 (https://phabricator.wikimedia.org/T380866) (owner: 10Btullis) [22:21:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:28:10] !log ryankemper@cumin1003 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [22:33:31] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:53:31] RESOLVED: SystemdUnitFailed: wdqs-updater.service on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T2300) [23:03:52] 06SRE, 06Traffic: Traffic cache daemon restart scripts need some rework - https://phabricator.wikimedia.org/T346640#11002716 (10BCornwall) Do we use these scripts anywhere or are they just for operator convenience? i.e. do we still need them? This also begs the question: Do we want this sort of operation to o... [23:07:01] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:14:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:15:13] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:24:36] jouncebot: nowandnext [23:24:36] For the next 0 hour(s) and 35 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250714T2300) [23:24:36] In 2 hour(s) and 35 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0200) [23:24:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:25:35] (03CR) 10Zabe: [C:03+2] Set categorylinks to read new on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169246 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [23:26:21] (03Merged) 10jenkins-bot: Set categorylinks to read new on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169246 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [23:26:59] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169246|Set categorylinks to read new on more wikis (T397912)]] [23:27:02] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [23:28:56] !log zabe@deploy1003 zabe: Backport for [[gerrit:1169246|Set categorylinks to read new on more wikis (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:29:53] !log zabe@deploy1003 zabe: Continuing with sync [23:30:59] (03PS4) 10Zabe: Set categorylinks to read new on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169198 (https://phabricator.wikimedia.org/T397912) [23:35:25] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169246|Set categorylinks to read new on more wikis (T397912)]] (duration: 08m 26s) [23:35:32] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [23:37:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1169265 [23:37:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1169265 (owner: 10TrainBranchBot) [23:39:46] (03PS1) 10Zabe: Disable categorylinks read new on wikis which depend on missing index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169266 [23:41:14] (03PS2) 10Zabe: Disable categorylinks read new on wikis which depend on missing index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169266 [23:42:55] (03CR) 10Zabe: [C:03+2] Disable categorylinks read new on wikis which depend on missing index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169266 (owner: 10Zabe) [23:43:43] (03Merged) 10jenkins-bot: Disable categorylinks read new on wikis which depend on missing index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169266 (owner: 10Zabe) [23:44:06] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169266|Disable categorylinks read new on wikis which depend on missing index]] [23:45:59] !log zabe@deploy1003 zabe: Backport for [[gerrit:1169266|Disable categorylinks read new on wikis which depend on missing index]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:47:55] !log zabe@deploy1003 zabe: Continuing with sync [23:50:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1169265 (owner: 10TrainBranchBot) [23:53:15] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169266|Disable categorylinks read new on wikis which depend on missing index]] (duration: 09m 09s)