[00:00:32] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1285907|Skin: Correct thumbnail class (T424910)]] [00:00:35] T424910: Limit Special:Preferences thumbnail option to three options - small, regular and large - https://phabricator.wikimedia.org/T424910 [00:02:22] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1285907|Skin: Correct thumbnail class (T424910)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:03:25] FIRING: [25x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:49] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [00:07:56] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285907|Skin: Correct thumbnail class (T424910)]] (duration: 07m 24s) [00:08:00] T424910: Limit Special:Preferences thumbnail option to three options - small, regular and large - https://phabricator.wikimedia.org/T424910 [00:08:25] FIRING: [50x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:51] (03PS1) 10Eevans: echostore: enable host verification (test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285929 (https://phabricator.wikimedia.org/T425308) [00:11:11] (03CR) 10Eevans: [C:03+2] echostore: enable host verification (test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285929 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans) [00:11:37] (done) [00:12:45] (03PS1) 10Dbrant: Add "get_login_creds" permission to Android app. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285930 (https://phabricator.wikimedia.org/T426010) [00:13:15] (03Merged) 10jenkins-bot: echostore: enable host verification (test) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285929 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans) [00:14:24] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/echostore: apply [00:16:41] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:34] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/echostore: apply [00:29:38] (03PS1) 10Eevans: echostore: disable TLS host verification [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285932 (https://phabricator.wikimedia.org/T425308) [00:30:34] (03PS2) 10Dbrant: Add "get_login_creds" permission to Android app. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285930 (https://phabricator.wikimedia.org/T426010) [00:31:26] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:31:55] (03CR) 10Eevans: [C:03+2] echostore: disable TLS host verification [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285932 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans) [00:34:00] (03Merged) 10jenkins-bot: echostore: disable TLS host verification [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285932 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans) [00:35:03] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/echostore: apply [00:35:09] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/echostore: apply [00:35:50] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply [00:36:12] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply [00:37:23] !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/echostore: apply [00:37:40] !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [00:38:14] (03PS3) 10Dbrant: Add "get_login_creds" permission to Android app. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285930 (https://phabricator.wikimedia.org/T426010) [01:00:48] (03PS5) 10Jasmine: wikikube: add wikikube-ctrl2006 [puppet] - 10https://gerrit.wikimedia.org/r/1249321 (https://phabricator.wikimedia.org/T406596) [01:01:44] (03PS6) 10Jasmine: wikikube: add wikikube-ctrl2006 [puppet] - 10https://gerrit.wikimedia.org/r/1249321 (https://phabricator.wikimedia.org/T406596) [01:02:28] (03CR) 10Jasmine: wikikube: add wikikube-ctrl2006 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249321 (https://phabricator.wikimedia.org/T406596) (owner: 10Jasmine) [01:09:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.47.0-wmf.2 [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1285933 (https://phabricator.wikimedia.org/T423911) [01:09:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.47.0-wmf.2 [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1285933 (https://phabricator.wikimedia.org/T423911) (owner: 10TrainBranchBot) [01:09:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1285934 [01:09:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1285934 (owner: 10TrainBranchBot) [01:19:53] (03Merged) 10jenkins-bot: Branch commit for wmf/1.47.0-wmf.2 [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1285933 (https://phabricator.wikimedia.org/T423911) (owner: 10TrainBranchBot) [01:22:04] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1285934 (owner: 10TrainBranchBot) [01:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -8d 11h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [01:44:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T0200) [02:01:02] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:03:52] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#11910693 (10Papaul) [02:06:51] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#11910694 (10Papaul) [02:07:40] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 38s) [02:09:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:20] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:23:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1119:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1119 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:28:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1119:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1119 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:32:20] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:34:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:00] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:00:04] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T0300) [03:01:53] (03PS1) 10TrainBranchBot: testwikis to 1.47.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285937 (https://phabricator.wikimedia.org/T423911) [03:01:56] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285937 (https://phabricator.wikimedia.org/T423911) (owner: 10TrainBranchBot) [03:02:48] (03Merged) 10jenkins-bot: testwikis to 1.47.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285937 (https://phabricator.wikimedia.org/T423911) (owner: 10TrainBranchBot) [03:03:13] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.47.0-wmf.2 refs T423911 [03:03:17] T423911: 1.47.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T423911 [03:18:18] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:28:18] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:39:49] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.47.0-wmf.2 refs T423911 (duration: 36m 36s) [03:39:53] T423911: 1.47.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T423911 [03:46:26] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T0400) [04:08:40] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:56:26] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:57:55] 10ops-drmrs: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T418136#11910824 (10phaultfinder) [05:10:41] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:15:41] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -8d 15h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [05:44:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T0600) [06:00:05] marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T0600). [06:02:33] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11910860 (10Marostegui) @CWilliams-WMF you can go directly for `ops` group instead of `ops-limited` so feel free to amend the patch. [06:20:46] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) (owner: 10Cathal Mooney) [06:22:29] (03CR) 10JMeybohm: [C:03+2] Revert^2 "Bump default rsyslog container version to 8.2504.0-1" [puppet] - 10https://gerrit.wikimedia.org/r/1285793 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [06:25:23] (03CR) 10JMeybohm: [C:03+2] Add ratelimit-media CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1285799 (https://phabricator.wikimedia.org/T414439) (owner: 10JMeybohm) [06:26:15] !log jayme@dns1004 START - running authdns-update [06:27:43] !log jayme@dns1004 END - running authdns-update [06:29:15] 10ops-drmrs: cr2-drmrs<->asw1-b12-drmrs down - https://phabricator.wikimedia.org/T425921#11910899 (10ayounsi) a:03RobH [06:32:58] (03CR) 10JMeybohm: [C:03+2] Bump release generation for mercurius to pick up rsyslog upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285794 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [06:33:20] (03PS1) 10Marostegui: mariadb: Decommission db2142 [puppet] - 10https://gerrit.wikimedia.org/r/1286167 (https://phabricator.wikimedia.org/T424038) [06:34:54] (03Merged) 10jenkins-bot: Bump release generation for mercurius to pick up rsyslog upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285794 (https://phabricator.wikimedia.org/T418200) (owner: 10JMeybohm) [06:35:07] (03CR) 10JMeybohm: [C:03+1] wikikube: add wikikube-ctrl2006 [puppet] - 10https://gerrit.wikimedia.org/r/1249321 (https://phabricator.wikimedia.org/T406596) (owner: 10Jasmine) [06:36:14] !log jayme@deploy1003 Started scap sync-world: update rsyslog image, T418200 [06:36:17] T418200: Migrate Service Ops Docker images running in production away from Bullseye - https://phabricator.wikimedia.org/T418200 [06:37:44] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2142.codfw.wmnet [06:37:54] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db2142 [puppet] - 10https://gerrit.wikimedia.org/r/1286167 (https://phabricator.wikimedia.org/T424038) (owner: 10Marostegui) [06:39:56] (03PS1) 10Marostegui: db1231,db2150: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1286168 (https://phabricator.wikimedia.org/T425388) [06:40:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2150.codfw.wmnet with reason: Reimage to Trixie [06:40:24] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2150: Reimage to Trixie [06:40:38] (03CR) 10Marostegui: [C:03+2] db1231,db2150: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1286168 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [06:40:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1231.eqiad.wmnet with reason: Reimage to Trixie [06:40:45] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1231: Reimage to Trixie [06:40:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2150: Reimage to Trixie [06:41:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1231: Reimage to Trixie [06:42:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2150.codfw.wmnet with OS trixie [06:42:31] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [06:42:35] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1231.eqiad.wmnet with OS trixie [06:43:41] !log jayme@deploy1003 Finished scap sync-world: update rsyslog image, T418200 (duration: 07m 56s) [06:43:44] T418200: Migrate Service Ops Docker images running in production away from Bullseye - https://phabricator.wikimedia.org/T418200 [06:46:36] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2142.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:47:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2142.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:47:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:47:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2142.codfw.wmnet [06:49:06] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2142.codfw.wmnet - https://phabricator.wikimedia.org/T424038#11910925 (10Marostegui) a:05Marostegui→03None [06:49:16] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2142.codfw.wmnet - https://phabricator.wikimedia.org/T424038#11910930 (10Marostegui) This is ready for #dc-ops [06:51:00] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:59:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1231.eqiad.wmnet with reason: host reimage [07:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:56] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2150.codfw.wmnet with reason: host reimage [07:04:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1231.eqiad.wmnet with reason: host reimage [07:04:46] 06SRE, 10SRE-Access-Requests: Update ssh key for kartik - https://phabricator.wikimedia.org/T425853#11910955 (10KartikMistry) >>! In T425853#11908952, @Dzahn wrote: > Could you maybe drop a file in some home directory on a production server that confirms it? > > Any place, just let us know where. OK. To be... [07:05:08] (03PS12) 10JMeybohm: tlsproxy::envoy: Support ratelimit configuration [puppet] - 10https://gerrit.wikimedia.org/r/1228995 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [07:08:18] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11910958 (10catherine.kelsey.wmde) Thanks @Dzahn for following up! And to answer your questions / comments: - @Lena_WMDE... [07:08:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2150.codfw.wmnet with reason: host reimage [07:09:56] (03CR) 10Marostegui: [C:03+1] admin: add Catherine Kelsey of WMDE as ldap_only user [puppet] - 10https://gerrit.wikimedia.org/r/1285890 (https://phabricator.wikimedia.org/T425566) (owner: 10Dzahn) [07:19:00] jouncebot: now [07:19:00] For the next 0 hour(s) and 40 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T0700) [07:19:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284628 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [07:22:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284628 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [07:23:11] (03Merged) 10jenkins-bot: cirrus: use a keywork tokenizer for the plain field for autocomplete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284628 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [07:26:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1231.eqiad.wmnet with OS trixie [07:28:59] 06SRE, 10SRE-Access-Requests: Update ssh key for kartik - https://phabricator.wikimedia.org/T425853#11910992 (10KartikMistry) If it can be a good place, https://office.wikimedia.org/wiki/User:KMistry_(WMF)#New_key I've put it here as well. [07:29:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1231: after reimage to trixie [07:29:51] (03CR) 10Marostegui: [C:03+1] "+1ed but still waiting for manager approval." [puppet] - 10https://gerrit.wikimedia.org/r/1285890 (https://phabricator.wikimedia.org/T425566) (owner: 10Dzahn) [07:30:05] (03PS1) 10Marostegui: Revert "db1231,db2150: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1286170 [07:31:00] (03CR) 10Marostegui: [C:03+2] Revert "db1231,db2150: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1286170 (owner: 10Marostegui) [07:31:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2150.codfw.wmnet with OS trixie [07:35:40] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2150: after reimage to trixie [07:47:51] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1228995 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [07:50:01] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11911032 (10Lena_WMDE) Hi everyone, I can confirm that @catherine.kelsey.wmde is working at WMDE as a data analyst and I appr... [07:50:08] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11911033 (10Lena_WMDE) Hi everyone, I can confirm that @catherine.kelsey.wmde is working at WMDE as a data analyst and I approve this request. Thanks! [07:54:15] (03PS1) 10DCausse: Revert "cirrus: use a keywork tokenizer for the plain field for autocomplete" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286253 [07:54:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286253 (owner: 10DCausse) [07:55:22] scap's broken, reverting merged patch [07:56:01] (03Merged) 10jenkins-bot: Revert "cirrus: use a keywork tokenizer for the plain field for autocomplete" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286253 (owner: 10DCausse) [07:56:30] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1286253|Revert "cirrus: use a keywork tokenizer for the plain field for autocomplete"]] [08:00:04] andre and brennen: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T0800). [08:00:23] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1286253|Revert "cirrus: use a keywork tokenizer for the plain field for autocomplete"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:00:42] dcausse: uh, scap's broken in which way? (Wondering if that would also affect the train) [08:00:49] !log dcausse@deploy1003 dcausse: Rolling back deployment [08:01:41] andre: there was some dirty state in /src/patches, jnuche just handled them [08:01:48] ah thanks! [08:03:12] (03PS1) 10DCausse: Revert^2 "cirrus: use a keywork tokenizer for the plain field for autocomplete" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286277 [08:03:12] (03CR) 10Marostegui: [C:03+1] "Manager confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1285890 (https://phabricator.wikimedia.org/T425566) (owner: 10Dzahn) [08:03:16] (03CR) 10Marostegui: [C:03+2] admin: add Catherine Kelsey of WMDE as ldap_only user [puppet] - 10https://gerrit.wikimedia.org/r/1285890 (https://phabricator.wikimedia.org/T425566) (owner: 10Dzahn) [08:03:32] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286253|Revert "cirrus: use a keywork tokenizer for the plain field for autocomplete"]] (duration: 07m 02s) [08:04:40] ok I should be done, I'll try to ship my config change in another window [08:04:57] dcausse: want to try again deploying that backport now that the stuck file in /srv/patches/ is gone [08:05:03] ah, heh, I just wanted to ask :) [08:05:15] andre: no thanks but I have to run :) [08:05:21] dcausse, ah, okay :) [08:05:30] then I will now start promoting group0 wikis to 1.47.0-wmf.2 [08:07:18] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [08:07:22] (03PS1) 10TrainBranchBot: group0 to 1.47.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286278 (https://phabricator.wikimedia.org/T423911) [08:07:25] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286278 (https://phabricator.wikimedia.org/T423911) (owner: 10TrainBranchBot) [08:08:11] (03CR) 10Mszwarc: [C:04-1] "Until the patch is updated so that we don't apply the restriction to SUL 'crats" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285905 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [08:08:28] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [08:08:40] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:45] (03Merged) 10jenkins-bot: group0 to 1.47.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286278 (https://phabricator.wikimedia.org/T423911) (owner: 10TrainBranchBot) [08:10:52] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11911070 (10Marostegui) 05In progress→03Resolved a:05catherine.kelsey.wmde→03Marostegui Done ` root@ldap-maint1001:~# ldapsearch -x cn=nda | gre... [08:11:30] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11911075 (10Marostegui) [08:11:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11911076 (10Marostegui) [08:14:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1231: after reimage to trixie [08:17:45] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.47.0-wmf.2 refs T423911 [08:17:48] T423911: 1.47.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T423911 [08:21:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2150: after reimage to trixie [08:23:51] (03Abandoned) 10Clare Ming: Update references to Test Kitchen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218395 (https://phabricator.wikimedia.org/T407906) (owner: 10Clare Ming) [08:32:32] FIRING: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:35:18] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1036.eqiad.wmnet with reason: Maintenance [08:35:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1036 (T419961)', diff saved to https://phabricator.wikimedia.org/P92476 and previous config saved to /var/cache/conftool/dbconfig/20260512-083526-fceratto.json [08:41:16] (03CR) 10JMeybohm: [C:03+2] tlsproxy::envoy: Support ratelimit configuration [puppet] - 10https://gerrit.wikimedia.org/r/1228995 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [08:42:38] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11911154 (10Marostegui) ssh key verified out of band [08:43:48] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11911158 (10Marostegui) [08:48:51] (03PS1) 10Blake: gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1286284 (https://phabricator.wikimedia.org/T422804) [08:50:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1036 (T419961)', diff saved to https://phabricator.wikimedia.org/P92477 and previous config saved to /var/cache/conftool/dbconfig/20260512-085009-fceratto.json [08:52:32] RESOLVED: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:53:52] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11911191 (10Marostegui) SSH key verified out of band [08:55:28] (03PS6) 10CWilliams: data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) [08:55:46] (03PS4) 10CWilliams: data.yaml: Adding cwilliams to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1285369 (https://phabricator.wikimedia.org/T425930) [08:56:19] (03CR) 10CI reject: [V:04-1] data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams) [08:56:41] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:49] (03CR) 10CI reject: [V:04-1] data.yaml: Adding cwilliams to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1285369 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams) [09:00:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1036', diff saved to https://phabricator.wikimedia.org/P92478 and previous config saved to /var/cache/conftool/dbconfig/20260512-090017-fceratto.json [09:05:53] (03CR) 10Atsuko: [C:04-1] "found a bug" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [09:10:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1036', diff saved to https://phabricator.wikimedia.org/P92479 and previous config saved to /var/cache/conftool/dbconfig/20260512-091025-fceratto.json [09:11:57] (03PS5) 10CWilliams: data.yaml: Adding cwilliams to ops [puppet] - 10https://gerrit.wikimedia.org/r/1285369 (https://phabricator.wikimedia.org/T425930) [09:12:55] (03CR) 10CI reject: [V:04-1] data.yaml: Adding cwilliams to ops [puppet] - 10https://gerrit.wikimedia.org/r/1285369 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams) [09:13:24] (03CR) 10CWilliams: "Updating reviewers" [puppet] - 10https://gerrit.wikimedia.org/r/1285369 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams) [09:15:58] (03CR) 10Effie Mouzeli: [C:03+1] gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1286284 (https://phabricator.wikimedia.org/T422804) (owner: 10Blake) [09:17:14] (03PS1) 10JMeybohm: tlsproxy::envoy: Various envoy config syntax fixes [puppet] - 10https://gerrit.wikimedia.org/r/1286290 (https://phabricator.wikimedia.org/T414440) [09:17:44] (03PS7) 10CWilliams: data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) [09:18:35] (03PS6) 10CWilliams: data.yaml: Adding cwilliams to ops [puppet] - 10https://gerrit.wikimedia.org/r/1285369 (https://phabricator.wikimedia.org/T425930) [09:20:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1036 (T419961)', diff saved to https://phabricator.wikimedia.org/P92480 and previous config saved to /var/cache/conftool/dbconfig/20260512-092034-fceratto.json [09:20:44] (03CR) 10Blake: [C:03+2] gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1286284 (https://phabricator.wikimedia.org/T422804) (owner: 10Blake) [09:20:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11911344 (10Marostegui) [09:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -8d 19h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [09:25:35] (03PS1) 10Marostegui: data.yaml: Add catherinekelsey to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1286291 (https://phabricator.wikimedia.org/T425565) [09:28:00] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286290 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [09:35:33] (03PS2) 10JMeybohm: tlsproxy::envoy: Various envoy config syntax fixes [puppet] - 10https://gerrit.wikimedia.org/r/1286290 (https://phabricator.wikimedia.org/T414440) [09:42:51] (03PS1) 10Kosta Harlan: Update UserEntitySerializer callers [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286295 [09:43:06] jouncebot: nowandnext [09:43:07] For the next 0 hour(s) and 16 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T0800) [09:43:07] In 0 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1000) [09:43:16] andre: shall we sync https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1286295 now? [09:43:33] (03PS2) 10Kosta Harlan: Update UserEntitySerializer callers [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286295 (https://phabricator.wikimedia.org/T426026) [09:43:40] kostajh: I think you can, this timeslot should be free [09:43:44] ok [09:44:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286295 (https://phabricator.wikimedia.org/T426026) (owner: 10Kosta Harlan) [09:44:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:59] (03Merged) 10jenkins-bot: Update UserEntitySerializer callers [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286295 (https://phabricator.wikimedia.org/T426026) (owner: 10Kosta Harlan) [09:48:22] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11911540 (10ayounsi) [09:48:28] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1286295|Update UserEntitySerializer callers (T426026)]] [09:48:32] T426026: ArgumentCountError: Too few arguments to function MediaWiki\Extension\EventBus\Serializers\MediaWiki\UserEntitySerializer::__construct(), 3 passed in /srv/mediawiki/php-1.47.0-wmf.2/extensions/WikimediaEvents/ - https://phabricator.wikimedia.org/T426026 [09:48:39] (03PS1) 10Ayounsi: Add network depool strategy to some DB roles [puppet] - 10https://gerrit.wikimedia.org/r/1286296 (https://phabricator.wikimedia.org/T425334) [09:50:21] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1286295|Update UserEntitySerializer callers (T426026)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:51:59] !log kharlan@deploy1003 kharlan: Continuing with deployment [09:56:11] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286295|Update UserEntitySerializer callers (T426026)]] (duration: 07m 43s) [09:56:15] T426026: ArgumentCountError: Too few arguments to function MediaWiki\Extension\EventBus\Serializers\MediaWiki\UserEntitySerializer::__construct(), 3 passed in /srv/mediawiki/php-1.47.0-wmf.2/extensions/WikimediaEvents/ - https://phabricator.wikimedia.org/T426026 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1000) [10:02:09] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:03:30] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki-common: add rdb2011 and rdb2012 IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285336 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [10:06:43] (03Merged) 10jenkins-bot: mediawiki-common: add rdb2011 and rdb2012 IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285336 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [10:07:21] kostajh: please give me a shout when you are done, there are some infra changes I need to deploy during this window [10:07:37] effie: I’m done [10:07:41] grand tx ! [10:08:45] (03PS3) 10JMeybohm: tlsproxy::envoy: Various envoy config syntax fixes [puppet] - 10https://gerrit.wikimedia.org/r/1286290 (https://phabricator.wikimedia.org/T414440) [10:08:45] (03PS1) 10JMeybohm: Disable rate limiting on ms-fe2009 [puppet] - 10https://gerrit.wikimedia.org/r/1286301 (https://phabricator.wikimedia.org/T414440) [10:09:09] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:09:09] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286290 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [10:13:48] (03CR) 10Blake: [C:03+1] "blake@deploy1003:~$ host mc1055" [puppet] - 10https://gerrit.wikimedia.org/r/1285785 (https://phabricator.wikimedia.org/T412255) (owner: 10Effie Mouzeli) [10:15:49] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter_wancache: replace mc1037 with mc1055 [puppet] - 10https://gerrit.wikimedia.org/r/1285785 (https://phabricator.wikimedia.org/T412255) (owner: 10Effie Mouzeli) [10:17:57] (03CR) 10JMeybohm: [C:03+2] tlsproxy::envoy: Various envoy config syntax fixes [puppet] - 10https://gerrit.wikimedia.org/r/1286290 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [10:18:03] (03CR) 10JMeybohm: [C:03+2] Disable rate limiting on ms-fe2009 [puppet] - 10https://gerrit.wikimedia.org/r/1286301 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [10:23:23] (03PS1) 10Volans: wmcs: add temporary logging to NFS tracing [puppet] - 10https://gerrit.wikimedia.org/r/1286305 [10:27:18] (03PS1) 10JMeybohm: Enable media rate limiting on ms-fe2010 [puppet] - 10https://gerrit.wikimedia.org/r/1286306 (https://phabricator.wikimedia.org/T414440) [10:31:24] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add new networks ibgp peering - cmooney@cumin1003" [10:31:48] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add new networks ibgp peering - cmooney@cumin1003" [10:39:02] (03PS1) 10Kosta Harlan: Special:UserLogin: Instrument no-JS form submissions [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286309 (https://phabricator.wikimedia.org/T425631) [10:39:39] (03PS1) 10Ayounsi: GraphQL: replace termination_z upstream_speed with commit_rate [software/homer] - 10https://gerrit.wikimedia.org/r/1286310 (https://phabricator.wikimedia.org/T424839) [10:43:01] (03CR) 10Blake: [C:03+2] k8s: Remove support for k8s versions before 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1278370 (https://phabricator.wikimedia.org/T423251) (owner: 10Blake) [10:43:17] (03PS1) 10Ayounsi: Replace upstream_speed with commit_rate [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) [10:43:27] (03CR) 10CI reject: [V:04-1] Special:UserLogin: Instrument no-JS form submissions [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286309 (https://phabricator.wikimedia.org/T425631) (owner: 10Kosta Harlan) [10:51:01] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:52:19] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286312 (https://phabricator.wikimedia.org/T393434) [10:53:30] (03PS9) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) [10:56:52] (03CR) 10CI reject: [V:04-1] GraphQL: replace termination_z upstream_speed with commit_rate [software/homer] - 10https://gerrit.wikimedia.org/r/1286310 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi) [11:03:03] is anyone using this deployment window ? [11:10:08] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11911764 (10Marostegui) Waiting for patch review - also no specific group approval is needed per this c... [11:11:30] (03CR) 10Marostegui: [C:03+1] "This looks and can go after: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1285368 (which still requires manager approval - being h" [puppet] - 10https://gerrit.wikimedia.org/r/1285369 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams) [11:12:23] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1286305 (owner: 10Volans) [11:13:37] (03CR) 10Majavah: [C:03+2] P:openstack: neutron: Set MTU on cloudnet eqiad1 VLAN interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1285759 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah) [11:16:33] (03CR) 10Ladsgroup: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:18:22] (03CR) 10Marostegui: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:19:03] (03CR) 10Majavah: [C:03+2] openstack: puppet-enc: Stop writing and drop old project column [puppet] - 10https://gerrit.wikimedia.org/r/1282944 (https://phabricator.wikimedia.org/T416588) (owner: 10Majavah) [11:22:46] (03CR) 10Ladsgroup: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:25:26] (03PS1) 10Majavah: openstack: encapi: Make new project column NOT NULL [puppet] - 10https://gerrit.wikimedia.org/r/1286315 (https://phabricator.wikimedia.org/T416588) [11:25:36] (03CR) 10Marostegui: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:29:30] (03CR) 10Majavah: [C:03+2] openstack: encapi: Make new project column NOT NULL [puppet] - 10https://gerrit.wikimedia.org/r/1286315 (https://phabricator.wikimedia.org/T416588) (owner: 10Majavah) [11:29:45] jouncebot: nowandnext [11:29:45] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [11:29:45] In 0 hour(s) and 30 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1200) [11:29:51] (03CR) 10Ladsgroup: [C:03+2] Disable FR on wikinews wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284683 (https://phabricator.wikimedia.org/T423577) (owner: 10Ladsgroup) [11:34:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284683 (https://phabricator.wikimedia.org/T423577) (owner: 10Ladsgroup) [11:40:01] Amir1: I have something to backport when you’re done [11:40:25] sure! [11:42:09] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for VisualEditor and MobileFrontend mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286318 (https://phabricator.wikimedia.org/T425940) [11:42:13] jouncebot: nowandnext [11:42:14] No deployments scheduled for the next 0 hour(s) and 17 minute(s) [11:42:14] In 0 hour(s) and 17 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1200) [11:42:23] kostajh: Same here after you [11:44:11] (03PS3) 10Mszwarc: Periodic jobs: add demote_ineligible_users (and _central_ counterpart) [puppet] - 10https://gerrit.wikimedia.org/r/1285315 (https://phabricator.wikimedia.org/T425396) [11:44:33] (03CR) 10Mszwarc: [C:04-1] Periodic jobs: add demote_ineligible_users (and _central_ counterpart) [puppet] - 10https://gerrit.wikimedia.org/r/1285315 (https://phabricator.wikimedia.org/T425396) (owner: 10Mszwarc) [11:45:11] (03CR) 10Ladsgroup: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:45:24] Amir1 Dreamy_Jazz I missed my deployment window, may I deploy after your? [11:45:42] I think it's Amir, Kosta, and then me [11:45:52] So should be fine if the next window doesn't need scap [11:46:04] before me it's jenkins taking its sweet sweet time to merge the patch [11:46:06] I do not need scap tbh, I just do not want to add noise [11:46:09] ... or issues [11:46:19] :D [11:46:20] (03CR) 10Marostegui: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:47:22] this is a long queue, and only Dreamy_Jazz is culturaly equipt to tolerate it [11:47:32] :D [11:47:38] Queues are my life :D [11:47:49] haha [11:48:34] Dreamy_Jazz: since you are the last one, please ping me when:) [11:51:28] Sure [11:54:41] actually, are we sure it's the merge queue? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1284683 that doesn't look like it's waiting on anything [11:54:52] (03CR) 10Ladsgroup: "again?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284683 (https://phabricator.wikimedia.org/T423577) (owner: 10Ladsgroup) [11:54:55] (03CR) 10Ladsgroup: [C:03+2] Disable FR on wikinews wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284683 (https://phabricator.wikimedia.org/T423577) (owner: 10Ladsgroup) [11:56:26] tests pass but it's not merging [11:56:36] (03PS2) 10Ladsgroup: Disable FR on wikinews wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284683 (https://phabricator.wikimedia.org/T423577) [11:56:43] sighhhh [11:56:51] it becomes an empty commit [11:57:00] lolol [11:57:37] (03Abandoned) 10Ladsgroup: Disable FR on wikinews wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284683 (https://phabricator.wikimedia.org/T423577) (owner: 10Ladsgroup) [11:57:48] kostajh: feel free to move forward [11:59:02] ok [11:59:33] (03CR) 10A smart kitten: codesearch: create script/timer to delete zombie lock files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [11:59:38] (03CR) 10Kosta Harlan: "recheck" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286309 (https://phabricator.wikimedia.org/T425631) (owner: 10Kosta Harlan) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1200) [12:00:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286309 (https://phabricator.wikimedia.org/T425631) (owner: 10Kosta Harlan) [12:02:01] (03Merged) 10jenkins-bot: Special:UserLogin: Instrument no-JS form submissions [extensions/WikimediaEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286309 (https://phabricator.wikimedia.org/T425631) (owner: 10Kosta Harlan) [12:02:35] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1286309|Special:UserLogin: Instrument no-JS form submissions (T425631)]] [12:02:37] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for DiscussionTools on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286322 (https://phabricator.wikimedia.org/T426039) [12:02:38] T425631: Instrument Special:UserLogin to detect non-JS form submissions - https://phabricator.wikimedia.org/T425631 [12:03:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285913 (https://phabricator.wikimedia.org/T422169) (owner: 10Anne Tomasevich) [12:03:35] (03CR) 10CI reject: [V:04-1] hCaptcha: Enable for DiscussionTools on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286322 (https://phabricator.wikimedia.org/T426039) (owner: 10Dreamy Jazz) [12:03:43] (03PS1) 10Dreamy Jazz: Show CAPTCHA if required for all edits before first edit attempt [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286324 (https://phabricator.wikimedia.org/T425955) [12:04:16] (03CR) 10JMeybohm: [C:03+2] Enable media rate limiting on ms-fe2010 [puppet] - 10https://gerrit.wikimedia.org/r/1286306 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [12:04:28] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1286309|Special:UserLogin: Instrument no-JS form submissions (T425631)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:05:43] (03CR) 10Dreamy Jazz: [C:03+2] Show CAPTCHA if required for all edits before first edit attempt [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286324 (https://phabricator.wikimedia.org/T425955) (owner: 10Dreamy Jazz) [12:06:04] !log kharlan@deploy1003 kharlan: Continuing with deployment [12:06:22] (03PS2) 10Dreamy Jazz: hCaptcha: Enable for DiscussionTools on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286322 (https://phabricator.wikimedia.org/T426039) [12:08:41] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:51] (03PS1) 10Anne Tomasevich: WelcomeSurvey: Respect returnTo for campaigns skipping the survey [extensions/GrowthExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286327 (https://phabricator.wikimedia.org/T422169) [12:09:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286327 (https://phabricator.wikimedia.org/T422169) (owner: 10Anne Tomasevich) [12:09:38] (03Merged) 10jenkins-bot: Show CAPTCHA if required for all edits before first edit attempt [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286324 (https://phabricator.wikimedia.org/T425955) (owner: 10Dreamy Jazz) [12:09:55] (03PS1) 10Dreamy Jazz: Make DiscussionTools not show hCaptcha initially unless configured [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286328 (https://phabricator.wikimedia.org/T425955) [12:10:08] (03CR) 10Dreamy Jazz: [C:03+2] Make DiscussionTools not show hCaptcha initially unless configured [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286328 (https://phabricator.wikimedia.org/T425955) (owner: 10Dreamy Jazz) [12:10:19] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286309|Special:UserLogin: Instrument no-JS form submissions (T425631)]] (duration: 07m 45s) [12:10:23] T425631: Instrument Special:UserLogin to detect non-JS form submissions - https://phabricator.wikimedia.org/T425631 [12:11:32] kostajh: you done? [12:13:17] yes [12:13:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286328 (https://phabricator.wikimedia.org/T425955) (owner: 10Dreamy Jazz) [12:13:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286322 (https://phabricator.wikimedia.org/T426039) (owner: 10Dreamy Jazz) [12:13:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286318 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [12:14:17] (03Merged) 10jenkins-bot: Make DiscussionTools not show hCaptcha initially unless configured [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286328 (https://phabricator.wikimedia.org/T425955) (owner: 10Dreamy Jazz) [12:15:20] (03Merged) 10jenkins-bot: hCaptcha: Enable for VisualEditor and MobileFrontend mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286318 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [12:15:23] (03Merged) 10jenkins-bot: hCaptcha: Enable for DiscussionTools on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286322 (https://phabricator.wikimedia.org/T426039) (owner: 10Dreamy Jazz) [12:15:52] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1286328|Make DiscussionTools not show hCaptcha initially unless configured (T425955)]], [[gerrit:1286324|Show CAPTCHA if required for all edits before first edit attempt (T425955)]], [[gerrit:1286322|hCaptcha: Enable for DiscussionTools on testwiki (T426039)]], [[gerrit:1286318|hCaptcha: Enable for VisualEditor and MobileFrontend mediawikiwiki (T425 [12:15:52] 940)]] [12:15:59] T425955: DiscussionTools hCaptcha: Show hCaptcha widget before first reply attempt - https://phabricator.wikimedia.org/T425955 [12:15:59] T426039: hCaptcha DiscussionTools: Rollout to WMF wikis - https://phabricator.wikimedia.org/T426039 [12:16:00] T425: Create an outline of QA/Browser test workshops to give - https://phabricator.wikimedia.org/T425 [12:16:06] jouncebot: nowandnext [12:16:06] For the next 0 hour(s) and 43 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1200) [12:16:06] In 0 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1300) [12:16:11] Poor T 425 :D [12:16:48] dcausse: we have a lovely spot in our queue for you [12:17:30] it has been a little crowded today [12:17:43] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1286328|Make DiscussionTools not show hCaptcha initially unless configured (T425955)]], [[gerrit:1286324|Show CAPTCHA if required for all edits before first edit attempt (T425955)]], [[gerrit:1286322|hCaptcha: Enable for DiscussionTools on testwiki (T426039)]], [[gerrit:1286318|hCaptcha: Enable for VisualEditor and MobileFrontend mediawikiwiki (T425940)]] synced [12:17:43] to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:17:47] (03CR) 10Ladsgroup: [C:03+1] "Should I deploy this?" [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [12:17:49] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [12:17:55] (03CR) 10Btullis: [C:03+2] Remove the site.pp definitions for decommissioned an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1242516 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [12:18:20] Testing... [12:18:42] effie_: np! :) [12:18:49] haha [12:20:20] (03CR) 10Ladsgroup: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [12:20:56] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [12:21:02] Testing complete (and worked) [12:21:36] this is my queue then ? [12:21:41] (pun intended) [12:21:47] :D [12:21:55] cheers [12:22:03] Yeah, just need to wait for scap to finish [12:23:55] Everything wikimedia.org has stopped loading for me [12:24:01] Including https://spiderpig.wikimedia.org/jobs/1967 [12:24:05] (03PS1) 10Sbisson: ArticleGuidance: set sparql endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286334 (https://phabricator.wikimedia.org/T425389) [12:24:41] Fixed by switching to mobile tether [12:25:04] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286328|Make DiscussionTools not show hCaptcha initially unless configured (T425955)]], [[gerrit:1286324|Show CAPTCHA if required for all edits before first edit attempt (T425955)]], [[gerrit:1286322|hCaptcha: Enable for DiscussionTools on testwiki (T426039)]], [[gerrit:1286318|hCaptcha: Enable for VisualEditor and MobileFrontend mediawikiwiki (T42 [12:25:04] 5940)]] (duration: 09m 12s) [12:25:10] T425955: DiscussionTools hCaptcha: Show hCaptcha widget before first reply attempt - https://phabricator.wikimedia.org/T425955 [12:25:11] T426039: hCaptcha DiscussionTools: Rollout to WMF wikis - https://phabricator.wikimedia.org/T426039 [12:25:15] effie_: Over to you [12:26:01] <3 [12:26:11] (03CR) 10Ladsgroup: "I think we are ready to merge this. Shall we go ahead and close the hypothesis after merge and deploy of this?" [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) (owner: 10Neriah) [12:26:37] (03PS1) 10Audrey Penven: Keep all long, non-wrapping values inside parent element [extensions/Wikibase] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286336 (https://phabricator.wikimedia.org/T425176) [12:26:44] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [12:26:54] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [12:27:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286334 (https://phabricator.wikimedia.org/T425389) (owner: 10Sbisson) [12:28:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/Wikibase] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286336 (https://phabricator.wikimedia.org/T425176) (owner: 10Audrey Penven) [12:28:59] (03PS1) 10Majavah: P:openstack: nova: Set MTU on flat VLAN interface in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1286337 (https://phabricator.wikimedia.org/T425674) [12:30:53] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1286337 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah) [12:31:04] (03CR) 10Volans: [C:03+2] wmcs: add temporary logging to NFS tracing [puppet] - 10https://gerrit.wikimedia.org/r/1286305 (owner: 10Volans) [12:33:16] (03PS4) 10Neriah: upload: Return 400 instead of 429 for non-standard thumbnail sizes [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) [12:33:42] (03CR) 10Neriah: upload: Return 400 instead of 429 for non-standard thumbnail sizes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) (owner: 10Neriah) [12:34:02] (03CR) 10Atsuko: translate: add opensearch-ttmserver-test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [12:35:40] (03CR) 10Lucas Werkmeister (WMDE): ArticleGuidance: set sparql endpoint (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286334 (https://phabricator.wikimedia.org/T425389) (owner: 10Sbisson) [12:37:35] (03CR) 10DCausse: "Testing this patch during a scap deploy might be a challenging." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [12:38:18] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [12:40:49] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [12:42:47] (03CR) 10Ayounsi: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/1286310 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi) [12:43:00] dcausse: need 2 mins to vetify [12:43:14] verify* [12:43:18] (03PS1) 10Ottomata: page_change - add revision.revert info [extensions/EventBus] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286341 (https://phabricator.wikimedia.org/T423583) [12:43:21] effie_: actually I might not be ready so no worries! :) [12:44:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/EventBus] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286341 (https://phabricator.wikimedia.org/T423583) (owner: 10Ottomata) [12:49:25] dcausse: I am done [12:49:39] thanks! :) [12:56:12] (03PS1) 10Effie Mouzeli: regex.yaml: disable extstore for new memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286346 (https://phabricator.wikimedia.org/T418263) [12:56:38] (03PS2) 10Effie Mouzeli: regex.yaml: disable extstore for new memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286346 (https://phabricator.wikimedia.org/T418263) [12:56:41] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:56:43] (03CR) 10CI reject: [V:04-1] regex.yaml: disable extstore for new memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286346 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [12:56:57] (03CR) 10DCausse: [C:03+1] "lgtm," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [12:57:11] (03CR) 10CI reject: [V:04-1] regex.yaml: disable extstore for new memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286346 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [12:59:47] (03PS3) 10Effie Mouzeli: regex.yaml: disable extstore for new memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286346 (https://phabricator.wikimedia.org/T418263) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1300). nyaa~ [13:00:05] stephanebisson, yerdua_wmde, and ottomata: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] o/ [13:00:17] nyaa~ [13:00:24] (03PS2) 10Sbisson: ArticleGuidance: set sparql endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286334 (https://phabricator.wikimedia.org/T425389) [13:00:52] Lucas_WMDE thanks for checking my patch. What do you think of the new PS? [13:00:57] looking [13:01:17] does it make sense to query the real WDQS from Beta? [13:01:33] (I don’t know what this feature is, but generally speaking Beta has different item IDs…) [13:01:37] (03PS4) 10Dbrant: docroot: Add "get_login_creds" permission to Android app. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285930 (https://phabricator.wikimedia.org/T426010) [13:01:56] It's working well query the real wdqs in beta [13:02:04] o/ [13:02:47] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286346 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [13:03:43] (03PS5) 10Dbrant: docroot: Add "get_login_creds" permission to Android app. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285930 (https://phabricator.wikimedia.org/T426010) [13:04:25] RESOLVED: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:31] stephanebisson: okay [13:04:39] are you going to deploy yourself or do you need a deployer? [13:04:46] I'll do it [13:04:47] ok [13:04:57] (and hopefully the code sends a good user agent to WDQS and all that) [13:05:10] (03CR) 10Lucas Werkmeister (WMDE): ArticleGuidance: set sparql endpoint (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286334 (https://phabricator.wikimedia.org/T425389) (owner: 10Sbisson) [13:05:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286334 (https://phabricator.wikimedia.org/T425389) (owner: 10Sbisson) [13:06:18] (03CR) 10Atsuko: "Committed to PrivateSettings.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [13:06:20] Yes, there's a new and improved UA with contact info [13:06:31] (03Merged) 10jenkins-bot: ArticleGuidance: set sparql endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286334 (https://phabricator.wikimedia.org/T425389) (owner: 10Sbisson) [13:06:54] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1286334|ArticleGuidance: set sparql endpoint (T425389)]] [13:06:58] T425389: Display the outline name that applies when listing Wikidata items in Article guidance - https://phabricator.wikimedia.org/T425389 [13:08:47] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1286334|ArticleGuidance: set sparql endpoint (T425389)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:52] (03CR) 10Atsuko: translate: add opensearch-ttmserver-test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [13:09:37] (03PS6) 10Dbrant: docroot: Add "get_login_creds" permission to Android app. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285930 (https://phabricator.wikimedia.org/T426010) [13:09:54] !log sbisson@deploy1003 sbisson: Continuing with deployment [13:10:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [13:12:41] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Wikibase] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286336 (https://phabricator.wikimedia.org/T425176) (owner: 10Audrey Penven) [13:12:59] ottomata: how risky is your backport? wondering if we should do it together with yerdua_wmde’s [13:13:06] (just to save a bit of time) [13:14:08] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286334|ArticleGuidance: set sparql endpoint (T425389)]] (duration: 07m 13s) [13:14:11] T425389: Display the outline name that applies when listing Wikidata items in Article guidance - https://phabricator.wikimedia.org/T425389 [13:15:01] hi! [13:15:07] Lucas_WMDE: should not be risky! [13:15:13] i can test on test mwdebug easily [13:15:33] ok then let’s do them together [13:15:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286336 (https://phabricator.wikimedia.org/T425176) (owner: 10Audrey Penven) [13:15:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/EventBus] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286341 (https://phabricator.wikimedia.org/T423583) (owner: 10Ottomata) [13:16:07] (AFAICT the risk of the Wikibase change causing server-side issues is basically zero) [13:16:26] that^ sounds right. it's a css change [13:16:53] Lucas_WMDE: I have tested in beta already. so when its on mwdebug lemme know and i'll try it out [13:16:53] (03Merged) 10jenkins-bot: Keep all long, non-wrapping values inside parent element [extensions/Wikibase] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286336 (https://phabricator.wikimedia.org/T425176) (owner: 10Audrey Penven) [13:17:00] ack [13:17:58] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy2006.codfw.wmnet with reason: Reboot [13:18:19] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [13:20:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2233.codfw.wmnet with reason: Reboot [13:20:57] (03PS1) 10Effie Mouzeli: site.pp add more memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286356 (https://phabricator.wikimedia.org/T418263) [13:21:48] (03CR) 10Blake: [C:03+1] regex.yaml: disable extstore for new memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286346 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [13:21:53] ACKNOWLEDGEMENT - dump of db_inventory in codfw on backupmon1001 is CRITICAL: Last dump for db_inventory at codfw (db2185) taken on 2026-05-12 00:38:14 is 1.9 MiB, but the previous one was 3 MiB, a change of -31.1 % Jcrespo can be ingnored - The acknowledgement expires at: 2026-05-19 13:21:33. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [13:21:53] ACKNOWLEDGEMENT - dump of db_inventory in eqiad on backupmon1001 is CRITICAL: Last dump for db_inventory at eqiad (db1215) taken on 2026-05-12 00:39:06 is 1.9 MiB, but the previous one was 3 MiB, a change of -31.1 % Jcrespo can be ingnored - The acknowledgement expires at: 2026-05-19 13:21:33. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [13:22:35] PROBLEM - MariaDB Replica IO: m2 on db2160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2233.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2233.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:22:36] (03PS4) 10Effie Mouzeli: regex.yaml: disable extstore for new memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286346 (https://phabricator.wikimedia.org/T418263) [13:22:38] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286346 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [13:22:47] (03PS1) 10Blake: gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1286298 (https://phabricator.wikimedia.org/T422804) [13:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -8d 23h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [13:25:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:08] (03CR) 10Blake: site.pp add more memcached servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286356 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [13:26:17] the EventBus gate-and-submit should be done any second now… [13:26:24] (03CR) 10Effie Mouzeli: "This build failed because something is wrong with jenkins' facts updates" [puppet] - 10https://gerrit.wikimedia.org/r/1286346 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [13:27:05] (03Merged) 10jenkins-bot: page_change - add revision.revert info [extensions/EventBus] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286341 (https://phabricator.wikimedia.org/T423583) (owner: 10Ottomata) [13:27:20] okay! [13:27:22] i'm here! [13:27:34] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1286336|Keep all long, non-wrapping values inside parent element (T425176)]], [[gerrit:1286341|page_change - add revision.revert info]] [13:27:36] not yet, it’s just starting deployment now :P [13:27:38] T425176: [MEX] M5 - 🐛Long quantity values break boundaries - https://phabricator.wikimedia.org/T425176 [13:28:35] RECOVERY - MariaDB Replica IO: m2 on db2160 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:28:54] (03CR) 10Effie Mouzeli: site.pp add more memcached servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286356 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [13:29:25] !log lucaswerkmeister-wmde@deploy1003 audreypenven, lucaswerkmeister-wmde, otto: Backport for [[gerrit:1286336|Keep all long, non-wrapping values inside parent element (T425176)]], [[gerrit:1286341|page_change - add revision.revert info]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:29:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285930 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [13:29:46] yerdua_wmde, ottomata: please test :) [13:29:51] (03CR) 10Blake: [C:03+1] site.pp add more memcached servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286356 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [13:29:57] (03PS1) 10JMeybohm: Disable media rate limiting on ms-fe210 [puppet] - 10https://gerrit.wikimedia.org/r/1286358 (https://phabricator.wikimedia.org/T414440) [13:30:00] (almost typoed ::). the spider emoticon) [13:30:41] (03PS2) 10JMeybohm: Disable media rate limiting on ms-fe2010 [puppet] - 10https://gerrit.wikimedia.org/r/1286358 (https://phabricator.wikimedia.org/T414440) [13:30:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:31:36] (03CR) 10Effie Mouzeli: [C:03+1] gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1286298 (https://phabricator.wikimedia.org/T422804) (owner: 10Blake) [13:32:00] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [13:32:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11912612 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye [13:32:07] testing [13:32:19] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [13:33:29] test.wikidata.org seems like it didn't get my fix. though it's unclear if it's a caching issue... checking in an incognito window [13:33:40] (03CR) 10Effie Mouzeli: [C:03+2] site.pp add more memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286356 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [13:33:51] https://test.wikidata.org/wiki/Special:Version says it’s on wmf.2 at least… [13:33:52] (03CR) 10Blake: [C:03+2] gateway-check.lua: Route some LiftWing endpoints through the REST gateway. [puppet] - 10https://gerrit.wikimedia.org/r/1286298 (https://phabricator.wikimedia.org/T422804) (owner: 10Blake) [13:35:01] (03CR) 10Effie Mouzeli: [C:03+2] regex.yaml: disable extstore for new memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286346 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [13:36:50] yerdua_wmde: I get the overflow treatment when turning on WikimediaDebug and reloading, I think [13:37:04] (I probably did a Ctrl+F5 force reload but tbh I don’t 100% remember) [13:37:28] ah, there it is [13:37:37] (03CR) 10Atsuko: [C:03+2] translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [13:37:52] works for me now [13:37:52] (03PS1) 10Sbisson: Add configurable user-agent and sparql endpoint url [extensions/ArticleGuidance] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286359 (https://phabricator.wikimedia.org/T425389) [13:38:00] okay, just waiting for ottomata to confirm then [13:38:03] thanks for testing :) [13:38:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ArticleGuidance] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286359 (https://phabricator.wikimedia.org/T425389) (owner: 10Sbisson) [13:39:00] (03Merged) 10jenkins-bot: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [13:39:13] uhhhhh [13:39:21] atsukoito: why are you merging config changes while I’m deploying? [13:39:34] trying, i am still looking for my event after i make an edit. im' doing somethign stupid i think [13:40:09] Lucas_WMDE: I did by mistake :< [13:40:38] Lucas_WMDE: should I revert? [13:41:05] probably yeah [13:41:29] (03CR) 10JMeybohm: [C:03+2] Disable media rate limiting on ms-fe2010 [puppet] - 10https://gerrit.wikimedia.org/r/1286358 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [13:41:30] ottomata: okay, thanks for the info. good luck [13:44:21] Lucas_WMDE: is testwiki fully on my patch? or is it just on mwdebug? [13:44:48] just mwdebug [13:44:54] k... [13:45:01] if jobs are involved in event delivery then that might interfere with testing? [13:45:19] no, no jobs, DeferredUpdate should post to eventgate after edit [13:45:24] hm ok [13:45:30] there are plenty of testwiki events flowing thoruhg [13:45:33] i just don't see my edit [13:45:39] i'm sure i'm doing something stupid [13:46:03] (03PS1) 10Atsuko: Revert "translate: add opensearch-ttmserver-test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286364 [13:46:29] Lucas_WMDE: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1286364 can you please merge the revert? [13:46:58] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] Revert "translate: add opensearch-ttmserver-test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286364 (owner: 10Atsuko) [13:47:08] done, thanks [13:47:18] oh i think i know! I always forget that eventgate-main needs rolling restart after new schema versuions [13:47:19] doing that now [13:47:25] sorry for the mess [13:47:27] ok [13:47:54] ottomata: and hopefully that’s compatible with the old code that’s still on non-mwdebug servers [13:47:59] yes it is [13:48:04] (“servers” in scare quotes because k8s but you know what I mean :P) [13:48:13] ok [13:48:25] !log roll restart eventgate main to pick up mediawiki/page/change/1.4.0 schema version for T423583 [13:48:26] (03CR) 10CI reject: [V:04-1] Revert "translate: add opensearch-ttmserver-test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286364 (owner: 10Atsuko) [13:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:28] T423583: mediawiki.page_change.v1 event - Add revision revert details - https://phabricator.wikimedia.org/T423583 [13:48:47] (03Merged) 10jenkins-bot: Revert "translate: add opensearch-ttmserver-test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286364 (owner: 10Atsuko) [13:48:51] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM! All 47 wikis are in the list + testwiki" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283758 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [13:48:55] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: sync [13:49:03] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [13:49:07] sukhe@cumin1003 reimage (PID 3036203) is awaiting input [13:49:18] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [13:49:42] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [13:49:52] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1017.eqiad.wmnet with OS bullseye [13:49:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11912751 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs1017 (**FAIL*... [13:49:57] (03CR) 10Andrew Bogott: [C:03+1] "Looks right to me but a final +1 from cathal might be good" [puppet] - 10https://gerrit.wikimedia.org/r/1286337 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah) [13:50:07] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [13:50:15] hi. can I add another change to the deployment now? [13:50:26] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [13:50:33] Neriah: you can, but I think it’s doubtful if there’ll be enough time for it [13:51:09] I think it could go together with something else. [13:51:25] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1285482 [13:51:27] the only other change is by atsukoito and I feel like that one needs to go on its own [13:51:32] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1270.eqiad.wmnet with OS bookworm [13:51:37] (because it involves PrivateSettings, though I haven’t looked into the details) [13:51:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11912755 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1270.eqiad.wmnet with OS bookworm [13:51:52] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [13:52:18] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: nova: Set MTU on flat VLAN interface in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1286337 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah) [13:52:45] Lucas_WMDE: let's pull my change out from the queue for today [13:52:54] (03PS1) 10Eevans: sessionstore: Upgrade prod to v1.0.19 (Debian Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286367 (https://phabricator.wikimedia.org/T425308) [13:52:58] okay [13:54:13] !log vriley@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:54:44] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [13:56:48] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11912783 (10VRiley-WMF) [13:57:11] !log vriley@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:57:35] Reedy: i need to sync deploy1003:/srv/wikimedia-staging/private/PrivateSettings.php last change to deployment-deploy04 (either the live values or just an empty `$wgOpensearchCredentials = []; //@see https://phabricator.wikimedia.org/T425377 `, can you please help? (moving from releng) [13:57:38] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [13:57:57] So deployment doesn't get a copy of prod's passwords [13:58:03] They're usually placeholders/different values etc [13:58:09] But yeah, I can add the empty array [13:58:52] Lucas_WMDE: somethign is fishy. I think my patch is fine but I am having trouble testing it. I have seen some of my edits come through, but it almost looks like ...minutes later? most events seem fine, but testwiki events come through much later. [13:58:52] Hm [13:59:00] hm [13:59:01] they seem backlogged somehow? which is strange? [13:59:03] ah-h, then the placeholder is fine, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1283711/10/private/readme.php [13:59:15] let me try mwdebug with a different wiki [13:59:17] ah-h, then the placeholder is fine, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1283711/10/private/readme.php ^^ Reedy [13:59:21] ok [13:59:32] Lucas_WMDE: this should work for any group0 wiki on mwdebug yes? [13:59:40] I think so yeah [13:59:51] you can check the wiki’s Special:Version page to make sure ^^ [13:59:54] what's a good one to test an edit on... [14:00:04] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1400) [14:00:05] test2wiki exists ^^ [14:00:07] or testwikidatawiki [14:00:14] re Test Kitchen: I’m still deploying, sorry [14:00:50] or mediawiki.org [14:01:10] atsukoito: Empty array added.. CI should sync it in its schedule "soon" [14:01:11] (03CR) 10Eevans: [C:03+2] sessionstore: Upgrade prod to v1.0.19 (Debian Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286367 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans) [14:01:24] yea mw.org will try [14:01:29] (03PS2) 10Dreamrimmer: Allow svwiki bureaucrats to remove sysop rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285482 (https://phabricator.wikimedia.org/T425806) [14:01:39] i will re-instate the diff for tomorrow's release then [14:01:48] alright, good luck atsukoito [14:02:03] Neriah: we won’t have time for another config change in this window, sorry [14:02:04] thanks Lucas_WMDE Reedy [14:02:18] no problem [14:02:22] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1272] - vriley@cumin1003" [14:02:40] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1272] - vriley@cumin1003" [14:02:40] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:02:41] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [14:03:02] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1272 [14:03:05] (Lucas’ pro tip for abbreviating domain names: if you’re abbreviating the second-level domain, always abbreviate the top-level domain as well, e.g. mw.o – otherwise who knows who you’re linking to ;)) [14:03:14] (fortunately mw[.]org just seems to be vacant at the moment) [14:03:35] (03Merged) 10jenkins-bot: sessionstore: Upgrade prod to v1.0.19 (Debian Trixie) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286367 (https://phabricator.wikimedia.org/T425308) (owner: 10Eevans) [14:03:52] Lucas_WMDE: I'm not sure what is going on. somethign is weird and im' not successfully testing. I don't think it is related to my patch. [14:03:52] I need to do other things so let's revert my patch for now? [14:03:56] i'll have to try again later. [14:03:59] (03PS1) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286371 (https://phabricator.wikimedia.org/T425377) [14:04:17] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1272 [14:04:51] ottomata: okay :/ [14:04:56] but then I need another deploy for the Wikibase change [14:04:57] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1272.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:05:02] ahhh sorry. [14:05:04] i mean, it should work [14:05:04] but yeah let’s be on the safe side [14:05:11] (though I am tempted to just let it roll out anyway) [14:05:13] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [14:05:17] electrician is here and needs to shut of electiricty for a minute!!! [14:05:19] so i have to go afk! [14:05:22] ok! [14:05:25] !log lucaswerkmeister-wmde@deploy1003 audreypenven, lucaswerkmeister-wmde, otto: Rolling back deployment [14:05:26] sorry about that [14:05:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:05:34] * Lucas_WMDE continues to be deploying, nobody else scap please [14:05:49] okay, so scap is “Rolling back k8s deployment for stage testservers” [14:05:55] I assume scap/SpiderPig won’t automatically Gerrit revert [14:05:58] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1271 [14:06:00] so I need to manually revert the EventBus backport [14:06:04] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [14:06:09] and I’ll probably force-merge that just to skip past the gate-and-submit [14:06:23] [a] Acknowledge and release the deployment lock [14:06:23] The backported change was undeployed by rollback, but it still exists in the codebase. [14:06:23] You must merge a fix or revert commit before allowing further backports to proceed. (default: [a]): [14:06:26] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:26] ok good to know, thanks scap [14:06:45] (03CR) 10DCausse: [C:03+1] translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286371 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [14:06:47] (03PS1) 10Lucas Werkmeister (WMDE): Revert "page_change - add revision.revert info" [extensions/EventBus] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286372 [14:07:01] (03CR) 10Lucas Werkmeister (WMDE): [V:03+2 C:03+2] "force-merging this, it’s a clean revert to a known-good state" [extensions/EventBus] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286372 (owner: 10Lucas Werkmeister (WMDE)) [14:07:06] sorry for the extra work lucas, I appreciate your help [14:07:10] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286336|Keep all long, non-wrapping values inside parent element (T425176)]], [[gerrit:1286341|page_change - add revision.revert info]] (duration: 39m 36s) [14:07:13] T425176: [MEX] M5 - 🐛Long quantity values break boundaries - https://phabricator.wikimedia.org/T425176 [14:07:16] something is weird here, i'll figure it out before I schedule this nex ttime. [14:07:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:07:30] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: apply [14:07:35] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [14:07:37] np [14:07:37] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: apply [14:07:41] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [14:07:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:07:53] !log root@cumin1003 START - Cookbook sre.hosts.reimage for host mc1056.eqiad.wmnet with OS bullseye [14:07:56] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1057.eqiad.wmnet with OS bullseye [14:07:58] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1058.eqiad.wmnet with OS bullseye [14:08:00] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1059.eqiad.wmnet with OS bullseye [14:08:03] vriley@cumin1003 provision (PID 3059879) is awaiting input [14:08:10] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/sessionstore: apply [14:08:19] hmph, the output in https://spiderpig.wikimedia.org/jobs/1970 is only partially useful [14:08:28] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [14:08:29] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1286336|Keep all long, non-wrapping values inside parent element (T425176)]], [[gerrit:1286372|Revert "page_change - add revision.revert info"]] [14:08:34] it complained about the change by atsukoito, as expected, and offered to show me the diff [14:08:35] (03PS1) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 [14:08:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11912829 (10ssingh) @VRiley-WMF: As John correctly pointed out, this is booting with UEFI enabled now. Is that expected and the default for all hosts now? If that is the case, we can... [14:09:06] which I expected would be empty. but instead it showed me the `git log --patch`, i.e. all the changes of the original commit and then the revert of those changes in the revert commit [14:09:11] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1272.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:09:29] but probably not worth a bug report. whatever [14:09:59] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1271 [14:10:24] !log lucaswerkmeister-wmde@deploy1003 audreypenven, lucaswerkmeister-wmde: Backport for [[gerrit:1286336|Keep all long, non-wrapping values inside parent element (T425176)]], [[gerrit:1286372|Revert "page_change - add revision.revert info"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:32] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1271.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:11:09] !log lucaswerkmeister-wmde@deploy1003 audreypenven, lucaswerkmeister-wmde: Continuing with deployment [14:15:32] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286336|Keep all long, non-wrapping values inside parent element (T425176)]], [[gerrit:1286372|Revert "page_change - add revision.revert info"]] (duration: 07m 02s) [14:15:36] T425176: [MEX] M5 - 🐛Long quantity values break boundaries - https://phabricator.wikimedia.org/T425176 [14:15:41] !log UTC afternoon backport+config window done [14:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:42] (03CR) 10VadymTS1: [C:03+1] "I can deploy this change today if you (DreamRimmer) don't mind" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285482 (https://phabricator.wikimedia.org/T425806) (owner: 10Dreamrimmer) [14:15:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285482 (https://phabricator.wikimedia.org/T425806) (owner: 10Dreamrimmer) [14:15:53] over to Test Kitchen UI Deployment Window, sorry for the delay [14:17:07] !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [14:17:24] !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [14:19:59] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1058.eqiad.wmnet with reason: host reimage [14:20:04] !log root@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1056.eqiad.wmnet with reason: host reimage [14:20:05] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1057.eqiad.wmnet with reason: host reimage [14:20:30] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1059.eqiad.wmnet with reason: host reimage [14:21:04] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from wdqs1028 to dse-k8s-wdqs-test1001 [14:21:08] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [14:22:11] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1271.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:22:29] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:24:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1058.eqiad.wmnet with reason: host reimage [14:26:30] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming wdqs1028 to dse-k8s-wdqs-test1001 - btullis@cumin1003" [14:26:30] (03PS1) 10Jelto: miscweb: bump wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286379 (https://phabricator.wikimedia.org/T414405) [14:26:31] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from wdqs2009 to dse-k8s-wdqs-test2001 [14:26:47] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming wdqs1028 to dse-k8s-wdqs-test1001 - btullis@cumin1003" [14:26:47] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:26:47] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-wdqs-test1001 on all recursors [14:26:49] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:26:51] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-wdqs-test1001 on all recursors [14:26:52] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-wdqs-test1001 [14:26:59] (03CR) 10Dzahn: [C:03+1] data.yaml: Add catherinekelsey to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1286291 (https://phabricator.wikimedia.org/T425565) (owner: 10Marostegui) [14:27:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-wdqs-test1001 [14:28:14] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1059.eqiad.wmnet with reason: host reimage [14:28:18] (03CR) 10Dzahn: "I would expect then we also have the make the zuul user the owner, not root." [puppet] - 10https://gerrit.wikimedia.org/r/1285923 (owner: 10Dduvall) [14:28:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from wdqs1028 to dse-k8s-wdqs-test1001 [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1430) [14:30:57] PROBLEM - Memcached on mc1057 is CRITICAL: connect to address 10.64.0.197 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:31:21] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 117.29 ms [14:31:50] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming wdqs2009 to dse-k8s-wdqs-test2001 - btullis@cumin1003" [14:32:43] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1056.eqiad.wmnet with reason: host reimage [14:32:49] (03CR) 10Jelto: [C:03+2] miscweb: bump wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286379 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:33:40] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [14:33:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming wdqs2009 to dse-k8s-wdqs-test2001 - btullis@cumin1003" [14:33:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:33:48] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-wdqs-test2001 on all recursors [14:33:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-wdqs-test2001 on all recursors [14:33:52] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-wdqs-test2001 [14:34:10] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-wdqs-test2001 [14:34:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283048 (https://phabricator.wikimedia.org/T425440) (owner: 10Danielyepezgarces) [14:34:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from wdqs2009 to dse-k8s-wdqs-test2001 [14:35:17] (03Merged) 10jenkins-bot: miscweb: bump wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286379 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:36:06] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1057.eqiad.wmnet with reason: host reimage [14:39:40] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1058.eqiad.wmnet with OS bullseye [14:42:58] RECOVERY - Memcached on mc1057 is OK: TCP OK - 0.000 second response time on 10.64.0.197 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [14:43:13] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1059.eqiad.wmnet with OS bullseye [14:44:41] (03CR) 10Joal: "Some comments. Let's also ask @btullis@wikimedia.org for his opinion on changing timers (I always wonder if we need to disable them first " [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [14:44:51] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [14:45:04] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [14:45:19] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test2001.codfw.wmnet with OS bookworm [14:46:30] (03PS1) 10Neriah: wikinews: removing unnecessary settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286384 (https://phabricator.wikimedia.org/T421796) [14:47:22] (03CR) 10Neriah: "@Ladsgroup@gmail.com..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286384 (https://phabricator.wikimedia.org/T421796) (owner: 10Neriah) [14:47:50] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1056.eqiad.wmnet with OS bullseye [14:49:47] (03CR) 10Federico Ceratto: [C:03+2] "LGTM and I can approve" [puppet] - 10https://gerrit.wikimedia.org/r/1286296 (https://phabricator.wikimedia.org/T425334) (owner: 10Ayounsi) [14:50:28] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1057.eqiad.wmnet with OS bullseye [14:51:01] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:53:38] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [14:54:19] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [14:54:39] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [14:55:04] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [14:55:21] (03PS1) 10VadymTS1: Set $wgSignatureAllowedLintErrors to an empty array on Spanish Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286390 (https://phabricator.wikimedia.org/T425332) [14:57:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [14:57:13] (03PS1) 10Effie Mouzeli: mcrouter_wancache: add mc1056-mc1059 [puppet] - 10https://gerrit.wikimedia.org/r/1286392 (https://phabricator.wikimedia.org/T418263) [14:57:51] btullis@cumin1003 reimage (PID 3081485) is awaiting input [14:58:16] (03PS2) 10Effie Mouzeli: mcrouter_wancache: add mc1056-mc1059 to production [puppet] - 10https://gerrit.wikimedia.org/r/1286392 (https://phabricator.wikimedia.org/T418263) [15:00:05] jelto, arnoldokoth, mutante, and arnaudb: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1500). [15:01:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286390 (https://phabricator.wikimedia.org/T425332) (owner: 10VadymTS1) [15:02:02] (03PS1) 10Eevans: linked-artifacts: deploy v1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286394 (https://phabricator.wikimedia.org/T425155) [15:02:41] (03CR) 10Blake: [C:03+1] mcrouter_wancache: add mc1056-mc1059 to production [puppet] - 10https://gerrit.wikimedia.org/r/1286392 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [15:02:52] (03CR) 10Neriah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286390 (https://phabricator.wikimedia.org/T425332) (owner: 10VadymTS1) [15:03:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286371 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [15:04:25] (03CR) 10Eevans: [C:03+2] linked-artifacts: deploy v1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286394 (https://phabricator.wikimedia.org/T425155) (owner: 10Eevans) [15:06:46] (03Merged) 10jenkins-bot: linked-artifacts: deploy v1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286394 (https://phabricator.wikimedia.org/T425155) (owner: 10Eevans) [15:06:57] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286312 (https://phabricator.wikimedia.org/T393434) (owner: 10Santiago Faci) [15:09:20] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286312 (https://phabricator.wikimedia.org/T393434) (owner: 10Santiago Faci) [15:11:46] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1270.eqiad.wmnet with OS bookworm [15:11:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11913223 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1270.eqiad.wmnet with OS bookworm executed with errors: - db1270 (**F... [15:12:09] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [15:12:23] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [15:16:39] (03PS1) 10Volans: wmcs: NFS tracing, skip non existend home/project [puppet] - 10https://gerrit.wikimedia.org/r/1286396 [15:17:21] !log dancy@deploy1003 Installing scap version "4.264.0" for 163 host(s) [15:20:07] (03PS1) 10JMeybohm: tlsproxy::envoy: Fix ratelimit grpc filter [puppet] - 10https://gerrit.wikimedia.org/r/1286399 (https://phabricator.wikimedia.org/T414440) [15:21:50] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286399 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [15:22:09] !log dancy@deploy1003 Installing scap version "4.264.0" for 1 host(s) [15:22:22] (03PS2) 10WMDE-Fisch: testwiki: Disable sub-ref's synthetic list defined refs on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286400 (https://phabricator.wikimedia.org/T425967) [15:23:01] !log dancy@deploy1003 Installation of scap version "4.264.0" completed for 1 hosts [15:23:16] (03CR) 10FNegri: [C:03+1] wmcs: NFS tracing, skip non existend home/project (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286396 (owner: 10Volans) [15:23:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286400 (https://phabricator.wikimedia.org/T425967) (owner: 10WMDE-Fisch) [15:23:55] !log dancy@deploy1003 Installing scap version "4.264.0" for 1 host(s) [15:24:07] (03CR) 10JMeybohm: [C:03+2] tlsproxy::envoy: Fix ratelimit grpc filter [puppet] - 10https://gerrit.wikimedia.org/r/1286399 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [15:24:46] !log dancy@deploy1003 Installation of scap version "4.264.0" completed for 1 hosts [15:25:20] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [15:25:50] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:26:05] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1270.eqiad.wmnet with OS bookworm [15:26:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11913304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1270.eqiad.wmnet with OS bookworm [15:26:48] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1272.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:27:02] (03CR) 10Ladsgroup: wikinews: removing unnecessary settings (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286384 (https://phabricator.wikimedia.org/T421796) (owner: 10Neriah) [15:28:38] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [15:29:40] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:30:19] (03CR) 10A smart kitten: wikinews: removing unnecessary settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286384 (https://phabricator.wikimedia.org/T421796) (owner: 10Neriah) [15:30:30] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:30:44] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:30:49] (03PS2) 10Neriah: wikinews: Remove unnecessary settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286384 (https://phabricator.wikimedia.org/T421796) [15:31:04] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1286405 (https://phabricator.wikimedia.org/T426083) [15:31:10] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286406 (https://phabricator.wikimedia.org/T426083) [15:31:13] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:31:48] (03CR) 10Neriah: wikinews: Remove unnecessary settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286384 (https://phabricator.wikimedia.org/T421796) (owner: 10Neriah) [15:31:51] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1286407 (https://phabricator.wikimedia.org/T426084) [15:31:57] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286408 (https://phabricator.wikimedia.org/T426084) [15:32:15] (03PS2) 10Alex.sanford: Enforce 2FA requirements for phase 2 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285905 (https://phabricator.wikimedia.org/T423119) [15:32:33] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1286409 (https://phabricator.wikimedia.org/T426086) [15:32:39] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286411 (https://phabricator.wikimedia.org/T426086) [15:33:20] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1230 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1286412 (https://phabricator.wikimedia.org/T426087) [15:33:24] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:33:25] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286413 (https://phabricator.wikimedia.org/T426087) [15:33:45] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:34:34] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1236 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1286416 (https://phabricator.wikimedia.org/T426088) [15:34:40] (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286417 (https://phabricator.wikimedia.org/T426088) [15:34:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11913418 (10VRiley-WMF) [15:34:50] !log helm uninstall -n miscweb design-strategy - T329991 [15:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:53] T329991: Upgrade Design/Strategy site to use Vitepress and Codex - https://phabricator.wikimedia.org/T329991 [15:35:00] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [15:35:42] (03CR) 10Ladsgroup: [C:03+1] wikinews: Remove unnecessary settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286384 (https://phabricator.wikimedia.org/T421796) (owner: 10Neriah) [15:37:02] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [15:37:11] (03CR) 10Mszwarc: [C:03+1] Enforce 2FA requirements for phase 2 groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285905 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [15:38:08] btullis@cumin1003 reimage (PID 3096595) is awaiting input [15:39:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:41:06] vriley@cumin1003 provision (PID 3128100) is awaiting input [15:41:21] jouncebot: nowandnext [15:41:21] For the next 0 hour(s) and 18 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1500) [15:41:21] In 0 hour(s) and 18 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1600) [15:42:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1272.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:42:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286384 (https://phabricator.wikimedia.org/T421796) (owner: 10Neriah) [15:44:49] (03Merged) 10jenkins-bot: wikinews: Remove unnecessary settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286384 (https://phabricator.wikimedia.org/T421796) (owner: 10Neriah) [15:44:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:45:16] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1286384|wikinews: Remove unnecessary settings (T421796)]] [15:45:20] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [15:47:17] !log ladsgroup@deploy1003 ladsgroup, neriah: Backport for [[gerrit:1286384|wikinews: Remove unnecessary settings (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:48:28] !log ladsgroup@deploy1003 ladsgroup, neriah: Continuing with deployment [15:48:49] (03PS1) 10Jdlrobson: Special:Preferences: Display three options for thumbsizes [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286421 (https://phabricator.wikimedia.org/T424910) [15:49:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:52:30] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [15:52:38] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [15:52:38] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286384|wikinews: Remove unnecessary settings (T421796)]] (duration: 07m 22s) [15:52:43] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [15:53:06] (03PS5) 10Nvdtn19: viwikivoyage: enable relatedarticle and pop-up Bug: T405724 Change-Id: I93fb76ed14880bd5b7a7fe25bd64fe5d86ed063d [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) [15:57:49] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:00:05] jhathaway and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:23] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:03:07] (03CR) 10Nvdtn19: "Do I need to rebase this patch and how do I do that? There are many conflicts." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) (owner: 10Nvdtn19) [16:08:41] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:10] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1193 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1286427 (https://phabricator.wikimedia.org/T426095) [16:10:16] (03PS1) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286428 (https://phabricator.wikimedia.org/T426095) [16:19:36] 06SRE, 06Content-Transform-Team, 06Wikipedia-Android-App-Backlog: Investigate Code 411 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11913724 (10Seddon) [16:19:42] (03PS5) 10Ecarg: Wikifunctions: add helm values for function-evaluator in Rust [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) [16:20:13] 06SRE, 06Content-Transform-Team, 06Wikipedia-Android-App-Backlog (Android Release - FY2025-26): Investigate Code 411 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11913733 (10cooltey) [16:20:59] 06SRE, 06Content-Transform-Team, 06Wikipedia-Android-App-Backlog (Android Release - FY2025-26): Investigate Code 411 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11913735 (10Seddon) @CDanis / @Jgiannelos [16:22:18] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2142.codfw.wmnet - https://phabricator.wikimedia.org/T424038#11913741 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:24:34] (03CR) 10Ecarg: Wikifunctions: add helm values for function-evaluator in Rust (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [16:25:25] !log installing Exim security updates on lists/vrts hosts [16:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:25] 06SRE, 06Content-Transform-Team, 06ServiceOps new, 06Wikipedia-Android-App-Backlog (Android Release - FY2025-26): Investigate Code 411 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11913799 (10CDanis) >>! In T425545#11900651, @cooltey wrot... [16:28:33] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11913802 (10Kappakayala) Approving access for @CWilliams-WMF. Please let me know if there is anything else that is needed from my end in this regard. [16:34:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:46:22] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1270.eqiad.wmnet with OS bookworm [16:46:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11913910 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1270.eqiad.wmnet with OS bookworm executed with errors: - db1270 (**F... [16:54:32] (03PS1) 10Cwhite: grafana: swap append() on set for add() [puppet] - 10https://gerrit.wikimedia.org/r/1286433 [16:54:56] (03CR) 10Ssingh: [C:03+1] "Let us know when we (Traffic) should roll this out?" [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) (owner: 10Neriah) [16:54:57] (03PS2) 10Cwhite: grafana: swap append() on set for add() [puppet] - 10https://gerrit.wikimedia.org/r/1286433 [16:55:02] (03CR) 10CI reject: [V:04-1] grafana: swap append() on set for add() [puppet] - 10https://gerrit.wikimedia.org/r/1286433 (owner: 10Cwhite) [16:55:30] (03CR) 10CI reject: [V:04-1] grafana: swap append() on set for add() [puppet] - 10https://gerrit.wikimedia.org/r/1286433 (owner: 10Cwhite) [16:57:22] (03PS3) 10Cwhite: grafana: swap append() on set for add() [puppet] - 10https://gerrit.wikimedia.org/r/1286433 [16:58:34] (03CR) 10Cwhite: [C:03+2] grafana: swap append() on set for add() [puppet] - 10https://gerrit.wikimedia.org/r/1286433 (owner: 10Cwhite) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1700) [17:05:07] (03CR) 10Dduvall: "We're not currently doing any UID remapping, so `root:root` ownership would make more sense." [puppet] - 10https://gerrit.wikimedia.org/r/1285923 (owner: 10Dduvall) [17:15:20] (03PS1) 10Ottomata: EventStreamConfig - ingest mediawiki.user_change into the Data Lake [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286434 (https://phabricator.wikimedia.org/T423952) [17:17:09] (03CR) 10Volans: [C:03+2] wmcs: NFS tracing, skip non existend home/project [puppet] - 10https://gerrit.wikimedia.org/r/1286396 (owner: 10Volans) [17:18:02] (03PS2) 10Volans: wmcs: NFS tracing, skip non existent home/project [puppet] - 10https://gerrit.wikimedia.org/r/1286396 [17:20:43] (03CR) 10FNegri: wmcs: NFS tracing, skip non existent home/project (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286396 (owner: 10Volans) [17:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -9d 3h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [17:28:13] (03PS3) 10Volans: wmcs: NFS tracing, skip non existent home/project [puppet] - 10https://gerrit.wikimedia.org/r/1286396 [17:28:57] (03CR) 10Volans: "added a log line to easily check it works fine" [puppet] - 10https://gerrit.wikimedia.org/r/1286396 (owner: 10Volans) [17:29:52] brett@cumin2002 reimage (PID 361025) is awaiting input [17:30:04] you're not my supervisor [17:30:14] (03PS4) 10Volans: wmcs: NFS tracing, skip non existent home/project [puppet] - 10https://gerrit.wikimedia.org/r/1286396 [17:34:46] (03CR) 10TChin: [C:03+1] EventStreamConfig - ingest mediawiki.user_change into the Data Lake [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286434 (https://phabricator.wikimedia.org/T423952) (owner: 10Ottomata) [17:35:31] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub: apply [17:36:03] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub: apply [17:36:32] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:36:50] (03CR) 10MVernon: [C:03+1] "Please go ahead." [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) (owner: 10Neriah) [17:37:21] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:37:22] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:37:39] !log brett@cumin2002 START - Cookbook sre.hosts.provision for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [17:37:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286434 (https://phabricator.wikimedia.org/T423952) (owner: 10Ottomata) [17:38:12] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:38:49] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:38:52] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:38:53] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:38:56] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:39:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDo [17:39:52] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:39:54] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:39:55] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:39:59] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:40:10] (03Merged) 10jenkins-bot: EventStreamConfig - ingest mediawiki.user_change into the Data Lake [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286434 (https://phabricator.wikimedia.org/T423952) (owner: 10Ottomata) [17:40:36] !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1286434|EventStreamConfig - ingest mediawiki.user_change into the Data Lake (T423952)]] [17:40:39] T423952: Create mediawiki.user_change event stream - https://phabricator.wikimedia.org/T423952 [17:42:31] !log otto@deploy1003 otto: Backport for [[gerrit:1286434|EventStreamConfig - ingest mediawiki.user_change into the Data Lake (T423952)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:43:39] brett@cumin2002 provision (PID 368874) is awaiting input [17:44:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:45:39] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:45:45] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:46:03] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:46:04] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:46:33] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:47:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11914178 (10BCornwall) I took a quick look and see that @VRiley-WMF seems to have run: ` $ sudo secure-cookbook sre.hosts.provision lvs1017 --no-user --no-dhcp ` That command shoul... [17:48:30] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:48:33] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:48:34] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:48:37] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:48:39] (03CR) 10FNegri: [C:03+1] wmcs: NFS tracing, skip non existent home/project [puppet] - 10https://gerrit.wikimedia.org/r/1286396 (owner: 10Volans) [17:48:53] brett@cumin2002 provision (PID 368874) is awaiting input [17:50:40] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:50:43] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:50:44] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:50:47] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:51:56] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:51:58] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:52:00] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:52:03] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:52:31] !log otto@deploy1003 otto: Continuing with deployment [17:53:37] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:53:41] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:53:42] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:53:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:53:47] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:54:20] brett@cumin2002 provision (PID 368874) is awaiting input [17:56:23] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [17:56:44] !log otto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286434|EventStreamConfig - ingest mediawiki.user_change into the Data Lake (T423952)]] (duration: 16m 08s) [17:56:45] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [17:56:47] T423952: Create mediawiki.user_change event stream - https://phabricator.wikimedia.org/T423952 [17:57:29] (03PS1) 10Jforrester: mathoid: Upgrade image to 2026-05-12-175031 with Node 24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286448 (https://phabricator.wikimedia.org/T364779) [17:58:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:00:05] andre and brennen: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T1800). nyaa~ [18:04:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:05:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11914273 (10Jclark-ctr) >>! In T421421#11914178, @BCornwall wrote: > I took a quick look and see that @VRiley-WMF seems to have run: > > ` > $ sudo secure-cookbook sre.hosts.provisi... [18:06:41] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:06:55] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11914279 (10Dzahn) [18:07:01] (03CR) 10Volans: [C:03+2] wmcs: NFS tracing, skip non existent home/project [puppet] - 10https://gerrit.wikimedia.org/r/1286396 (owner: 10Volans) [18:07:56] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11914287 (10Dzahn) [18:08:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11914303 (10BCornwall) @Jclark-ctr Might wanna update https://wikitech.wikimedia.org/wiki/UEFI_Boot#Reconfigure_the_server_to_boot_via_UEFI then :) [18:08:34] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1017.eqiad.wmnet with OS bullseye [18:08:51] !log brett@cumin2002 START - Cookbook sre.hosts.provision for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [18:08:51] 06SRE, 06Infrastructure-Foundations, 10Mail: Wiki email not delivered to GMail - https://phabricator.wikimedia.org/T243937#11914311 (10Xaosflux) 05Open→03Resolved a:03Xaosflux I'm going to mark this closed as there are no recent issues, I've personally tested multiple email workflows successful to... [18:10:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#11914322 (10YLiou_WMF) 05Resolved→03Open [18:10:45] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:13:44] RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:13:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#11914341 (10YLiou_WMF) 05Open→03Resolved [18:15:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:17:40] (03CR) 10Ssingh: [C:03+1] "@bcornwall@wikimedia.org: can you please run VCL tests and merge this tomorrow? Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) (owner: 10Neriah) [18:20:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:20:45] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:24:41] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [18:25:11] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [18:25:21] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:25:45] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:33:27] (03CR) 10BCornwall: "Looks like it's failing two tests: 18-thumb-bad-extension.vtc and 20-content-type-fixup.vtc. Both with:" [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) (owner: 10Neriah) [18:33:37] (03CR) 10BCornwall: [V:04-1] upload: Return 400 instead of 429 for non-standard thumbnail sizes [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) (owner: 10Neriah) [18:35:18] !log brett@cumin2002 START - Cookbook sre.hosts.provision for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [18:36:27] (03PS1) 10Jdlrobson: Disable interactions until load is complete [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286456 (https://phabricator.wikimedia.org/T422968) [18:36:41] (03PS2) 10Dduvall: zuul: Set mode of SSH private key to 0400 [puppet] - 10https://gerrit.wikimedia.org/r/1285923 [18:42:11] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [18:56:01] 06SRE, 10SRE-Access-Requests: Update ssh key for kartik - https://phabricator.wikimedia.org/T425853#11914468 (10Dzahn) @KartikMistry Understood. Yea, that also works! **Key verified**. Patch uploaded! [18:56:39] 06SRE, 10SRE-Access-Requests: Update ssh key for kartik - https://phabricator.wikimedia.org/T425853#11914471 (10Dzahn) a:05KartikMistry→03None [18:57:09] (03PS1) 10Dzahn: admin: update SSH key for Kartik [puppet] - 10https://gerrit.wikimedia.org/r/1286461 (https://phabricator.wikimedia.org/T425853) [19:00:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11914494 (10ssingh) 05Open→03Resolved a:03ssingh Things look good and lvs2012 is happily serving traffic. Marking as resolved, thanks @Jhancock.wm! [19:01:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:01:54] (03CR) 10Dzahn: [V:03+1] "https://office.wikimedia.org/w/index.php?title=User%3AKMistry_%28WMF%29&diff=376855&oldid=327625" [puppet] - 10https://gerrit.wikimedia.org/r/1286461 (https://phabricator.wikimedia.org/T425853) (owner: 10Dzahn) [19:05:39] !log brett@cumin2002 START - Cookbook sre.hosts.provision for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:06:21] !log migrate link from cr1-magru to asw1-b3-magru to L2 trunk on the switch side T424611 [19:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:24] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [19:06:45] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:07:44] (03PS1) 10Dduvall: zuul: Run zuul-scheduler/-launcher/-web as zuul user [puppet] - 10https://gerrit.wikimedia.org/r/1286463 [19:07:49] 06SRE, 10Wikimedia-Mailing-lists: Create mailing list for ukwiki arbcom - https://phabricator.wikimedia.org/T426108#11914516 (10Ladsgroup) We need two mailman owners. Feel free to send me the other one in private. [19:10:52] (03PS1) 10Jforrester: Fix MediaHandler caching to not preserve language [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286464 (https://phabricator.wikimedia.org/T425988) [19:11:07] (03PS1) 10Jforrester: Fix MediaHandler caching to not preserve language [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286465 (https://phabricator.wikimedia.org/T425988) [19:14:37] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [19:15:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286464 (https://phabricator.wikimedia.org/T425988) (owner: 10Jforrester) [19:18:21] 06SRE, 10Wikimedia-Mailing-lists: Create mailing list for ukwiki arbcom - https://phabricator.wikimedia.org/T426108#11914540 (10VadymTS1) @Ladsgroup The second owner email: repakrporget@gmail.com [19:18:26] (03CR) 10Dduvall: [C:04-1] "We need a fixed/reserved uid for the zuul user first, and we might want to add the same user/group with matching uid/gid to the images as " [puppet] - 10https://gerrit.wikimedia.org/r/1286463 (owner: 10Dduvall) [19:21:35] (03CR) 10CI reject: [V:04-1] Fix MediaHandler caching to not preserve language [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286465 (https://phabricator.wikimedia.org/T425988) (owner: 10Jforrester) [19:22:23] (03CR) 10Eric Gardner: [C:03+1] Disable interactions until load is complete [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286456 (https://phabricator.wikimedia.org/T422968) (owner: 10Jdlrobson) [19:23:29] (03CR) 10A smart kitten: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233) [19:24:19] (03CR) 10CI reject: [V:04-1] change logo at zh-classical wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233) [19:25:36] (03PS1) 10Alex.sanford: Prepare $wgOATH2FARequiredGroupRemovalPages for phases 2 and 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286469 (https://phabricator.wikimedia.org/T423119) [19:25:58] !log migrate link from cr2-magru to asw1-b3-magru to L2 trunk on the switch side T424611 [19:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:02] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [19:27:12] (03Merged) 10jenkins-bot: Fix MediaHandler caching to not preserve language [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286464 (https://phabricator.wikimedia.org/T425988) (owner: 10Jforrester) [19:27:38] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1286464|Fix MediaHandler caching to not preserve language (T425988 T425740 T425782)]] [19:27:44] T425988: Deprecated: Accessing the language without explicitly setting it via MediaHandler:setLanguage, MediaHandler::getHandler, or MediaHandlerFactory::getHandler - https://phabricator.wikimedia.org/T425988 [19:27:45] T425740: File:[filename] throws "RuntimeException: Need to set language before accessing." - https://phabricator.wikimedia.org/T425740 [19:27:45] T425782: [Core][BUG] Need to set language before accessing - https://phabricator.wikimedia.org/T425782 [19:28:02] RECOVERY - Confd vcl based reload on cp6016 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [19:30:02] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:30:02] PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [19:30:04] !log dancy@deploy1003 jforrester, dancy: Backport for [[gerrit:1286464|Fix MediaHandler caching to not preserve language (T425988 T425740 T425782)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:30:30] !log dancy@deploy1003 jforrester, dancy: Continuing with deployment [19:30:32] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1017.eqiad.wmnet with reason: host reimage [19:31:19] (03CR) 10Mszwarc: [C:03+1] Prepare $wgOATH2FARequiredGroupRemovalPages for phases 2 and 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286469 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [19:34:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286469 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [19:34:45] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286464|Fix MediaHandler caching to not preserve language (T425988 T425740 T425782)]] (duration: 07m 07s) [19:34:51] T425988: Deprecated: Accessing the language without explicitly setting it via MediaHandler:setLanguage, MediaHandler::getHandler, or MediaHandlerFactory::getHandler - https://phabricator.wikimedia.org/T425988 [19:34:52] T425740: File:[filename] throws "RuntimeException: Need to set language before accessing." - https://phabricator.wikimedia.org/T425740 [19:34:52] T425782: [Core][BUG] Need to set language before accessing - https://phabricator.wikimedia.org/T425782 [19:35:02] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1017.eqiad.wmnet with reason: host reimage [19:37:20] (03CR) 10Neriah: "recheck" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286465 (https://phabricator.wikimedia.org/T425988) (owner: 10Jforrester) [19:39:36] (03CR) 10BCornwall: [C:03+1] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286417 (https://phabricator.wikimedia.org/T426088) (owner: 10Gerrit maintenance bot) [19:40:10] (03CR) 10BCornwall: [C:03+1] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1276878 (https://phabricator.wikimedia.org/T424315) (owner: 10Gerrit maintenance bot) [19:40:43] (03CR) 10BCornwall: [C:03+1] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286428 (https://phabricator.wikimedia.org/T426095) (owner: 10Gerrit maintenance bot) [19:41:00] (03CR) 10BCornwall: [C:03+1] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286406 (https://phabricator.wikimedia.org/T426083) (owner: 10Gerrit maintenance bot) [19:41:14] (03CR) 10BCornwall: [C:03+1] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286408 (https://phabricator.wikimedia.org/T426084) (owner: 10Gerrit maintenance bot) [19:41:23] (03CR) 10BCornwall: [C:03+1] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286411 (https://phabricator.wikimedia.org/T426086) (owner: 10Gerrit maintenance bot) [19:41:32] (03CR) 10BCornwall: [C:03+1] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286413 (https://phabricator.wikimedia.org/T426087) (owner: 10Gerrit maintenance bot) [19:41:44] (03CR) 10BCornwall: [C:03+1] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1284560 (https://phabricator.wikimedia.org/T425622) (owner: 10Gerrit maintenance bot) [19:42:44] (03CR) 10BCornwall: [C:03+1] Add Kubernetes POD IP reverse range delegations for wikikube-ctrl2006 [dns] - 10https://gerrit.wikimedia.org/r/1285465 (https://phabricator.wikimedia.org/T406596) (owner: 10Jasmine) [19:43:07] (03CR) 10BCornwall: [C:03+1] wmnet: add wikikube-ctrl2006 to etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1249423 (https://phabricator.wikimedia.org/T406596) (owner: 10Jasmine) [19:43:20] !log migrate link from cr1-magru to asw1-b4-magru to L2 trunk on the switch side T424611 [19:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:23] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [19:43:25] (03CR) 10BCornwall: [C:03+1] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1277601 (https://phabricator.wikimedia.org/T424551) (owner: 10Gerrit maintenance bot) [19:51:15] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1017.eqiad.wmnet with OS bullseye [19:52:32] !log migrate link from cr2-magru to asw1-b4-magru to L2 trunk on the switch side T424611 [19:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:36] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [19:54:16] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:55:57] (03PS5) 10Neriah: upload: Return 400 instead of 429 for non-standard thumbnail sizes [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) [19:56:30] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1270.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:58:40] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1270.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:59:30] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1270.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:00:03] cmooney@cumin1003 netbox (PID 3309795) is awaiting input [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T2000). Please do the needful. [20:00:05] alexsanford, dbrant, Neriah, and VadymTS1: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:06] hi [20:00:11] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:00:14] Hey [20:00:17] o/ [20:00:24] hi [20:01:22] I'll go ahead with my two config changes, if there are no objections? [20:01:33] go for it [20:01:53] okay [20:02:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by alexsanford@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285905 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [20:02:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by alexsanford@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286469 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [20:03:02] (03Merged) 10jenkins-bot: Enforce 2FA requirements for phase 2 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285905 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [20:03:06] (03Merged) 10jenkins-bot: Prepare $wgOATH2FARequiredGroupRemovalPages for phases 2 and 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286469 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [20:03:35] !log alexsanford@deploy1003 Started scap sync-world: Backport for [[gerrit:1285905|Enforce 2FA requirements for phase 2 groups (T423119)]], [[gerrit:1286469|Prepare $wgOATH2FARequiredGroupRemovalPages for phases 2 and 3 (T423119 T423120)]] [20:03:40] T423119: FY25-26 Q4: Phase 2 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423119 [20:03:40] T423120: FY25-26 Q4: Phase 3 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423120 [20:05:19] !log migrate link from cr1-esams to asw1-bw27-esams to L2 trunk on the switch side T424611 [20:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:23] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [20:05:31] !log alexsanford@deploy1003 alexsanford: Backport for [[gerrit:1285905|Enforce 2FA requirements for phase 2 groups (T423119)]], [[gerrit:1286469|Prepare $wgOATH2FARequiredGroupRemovalPages for phases 2 and 3 (T423119 T423120)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:06:14] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.24.0-a3 [vendor] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286484 (https://phabricator.wikimedia.org/T409751) [20:06:39] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.24.0-a3 [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286485 (https://phabricator.wikimedia.org/T425981) [20:06:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286485 (https://phabricator.wikimedia.org/T425981) (owner: 10C. Scott Ananian) [20:08:41] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:14] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1270.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:10:22] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1270.eqiad.wmnet with OS bookworm [20:10:31] (03PS1) 10Lerickson: Add max-batches option to cap the size of a wikibase RDF dump. [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) [20:10:33] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11914707 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1270.eqiad.wmnet with OS bookworm [20:11:09] o/ (i'm a late addition) [20:11:09] !log alexsanford@deploy1003 alexsanford: Continuing with deployment [20:12:20] cscott: your change needs to be rechecked [20:13:43] (03CR) 10CI reject: [V:04-1] Bump wikimedia/parsoid to 0.24.0-a3 [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286485 (https://phabricator.wikimedia.org/T425981) (owner: 10C. Scott Ananian) [20:13:50] (03CR) 10Subramanya Sastry: [C:03+1] Bump wikimedia/parsoid to 0.24.0-a3 [vendor] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286484 (https://phabricator.wikimedia.org/T409751) (owner: 10C. Scott Ananian) [20:13:55] (03PS1) 10C. Scott Ananian: Disable unit tests that fail with new vendor release [extensions/Kartographer] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286488 [20:13:59] (03CR) 10CI reject: [V:04-1] Bump wikimedia/parsoid to 0.24.0-a3 [vendor] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286484 (https://phabricator.wikimedia.org/T409751) (owner: 10C. Scott Ananian) [20:14:05] (03CR) 10Zuul test: "This change depends on a change that failed to merge." [vendor] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286484 (https://phabricator.wikimedia.org/T409751) (owner: 10C. Scott Ananian) [20:14:06] (03CR) 10Zuul test: "This change depends on a change that failed to merge." [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286485 (https://phabricator.wikimedia.org/T425981) (owner: 10C. Scott Ananian) [20:14:11] (03CR) 10Subramanya Sastry: [C:03+1] Bump wikimedia/parsoid to 0.24.0-a3 [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286485 (https://phabricator.wikimedia.org/T425981) (owner: 10C. Scott Ananian) [20:14:22] (03PS1) 10C. Scott Ananian: Skip ContentHolderTest that fails with new vendor release [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286489 [20:14:25] (03PS2) 10Lerickson: Add max-batches option to cap the size of a wikibase RDF dump. [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) [20:14:31] Neriah: thanks [20:14:41] (03CR) 10Subramanya Sastry: [C:03+1] Disable unit tests that fail with new vendor release [extensions/Kartographer] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286488 (owner: 10C. Scott Ananian) [20:15:00] (03CR) 10C. Scott Ananian: "recheck" [vendor] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286484 (https://phabricator.wikimedia.org/T409751) (owner: 10C. Scott Ananian) [20:15:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11914724 (10BCornwall) [20:15:22] !log alexsanford@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285905|Enforce 2FA requirements for phase 2 groups (T423119)]], [[gerrit:1286469|Prepare $wgOATH2FARequiredGroupRemovalPages for phases 2 and 3 (T423119 T423120)]] (duration: 11m 47s) [20:15:28] T423119: FY25-26 Q4: Phase 2 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423119 [20:15:28] T423120: FY25-26 Q4: Phase 3 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423120 [20:15:53] Done [20:15:55] (03PS3) 10Lerickson: Add max-batches option to cap the size of a wikibase RDF dump. [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) [20:16:10] doing mine... [20:16:20] !log migrate link from cr2-esams to asw1-bw27-esams to L2 trunk on the switch side T424611 [20:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:23] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [20:16:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dbrant@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285930 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [20:17:14] PROBLEM - SSH on an-druid1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:17:17] (03PS4) 10Lerickson: Add max-batches option to cap the size of a wikibase RDF dump. [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) [20:17:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/Kartographer] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286488 (owner: 10C. Scott Ananian) [20:17:54] (03Merged) 10jenkins-bot: docroot: Add "get_login_creds" permission to Android app. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285930 (https://phabricator.wikimedia.org/T426010) (owner: 10Dbrant) [20:17:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286489 (owner: 10C. Scott Ananian) [20:18:05] i think mine can be deployed together with another change [20:18:15] Neriah: it might not be a blocker to it being deployed (i wouldn't be the person to make that decision anyway), but i'm just curious if DreamRimmer gave the okay to their svwiki patch being scheduled/deployed on their behalf? i'm only asking as (at least in my personal experience) it's usually the owner of such a site-request patch that schedules it for deployment / is in #wikimedia-operations at the time of the deployment [20:18:18] !log dbrant@deploy1003 Started scap sync-world: Backport for [[gerrit:1285930|docroot: Add "get_login_creds" permission to Android app. (T426010)]] [20:18:22] T426010: Enable integration with Credential Manager - https://phabricator.wikimedia.org/T426010 [20:19:22] (03CR) 10C. Scott Ananian: "recheck" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286485 (https://phabricator.wikimedia.org/T425981) (owner: 10C. Scott Ananian) [20:20:09] (03CR) 10Lerickson: "Hi all, Ben suggested this change today as a way to be able to iterate more quickly while testing that the dump DAG works (to avoid waitin" [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) (owner: 10Lerickson) [20:20:15] !log dbrant@deploy1003 dbrant: Backport for [[gerrit:1285930|docroot: Add "get_login_creds" permission to Android app. (T426010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:20:54] !log dbrant@deploy1003 dbrant: Continuing with deployment [20:21:56] (03CR) 10BCornwall: "Thank you for updating the tests!" [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) (owner: 10Neriah) [20:22:11] A_smart_kitten: I remember seeing a few cases where this wasn't the case, so I did it myself. [20:22:23] is there a way to talk to him now? [20:22:43] similar question for VadymTS1 re the RSS cowikimedia patch [20:23:22] Neriah: i wouldn't know to be honest, i'm not in personal contact [20:23:25] !log migrate link from cr1-esams to asw1-by27-esams to L2 trunk on the switch side T424611 [20:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:29] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [20:24:31] Neriah: if you recall a previous case when this sort of thing has happened then i'd be interested in it if you remember e.g. what day/backport window it was at (not in a bad way, to be clear! i'd just be personally interested in reading up the circumstances of previous cases) [20:24:32] A_smart_kitten: I planned this because the change wasn't implemented yesterday. Also, I often see other users trashing other people's changes. [20:25:06] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [20:25:48] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1271.eqiad.wmnet with OS bookworm [20:25:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11914752 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1271.eqiad.wmnet with OS bookworm [20:26:45] !log dbrant@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285930|docroot: Add "get_login_creds" permission to Android app. (T426010)]] (duration: 08m 27s) [20:26:48] T426010: Enable integration with Credential Manager - https://phabricator.wikimedia.org/T426010 [20:26:59] VadymTS1: i would welcome someone more experienced than me to confirm/deny my thoughts here; but at least in my personal experience, it seems like it's normally a site-request patch's author that schedules & is around for deployment. i suppose maybe at least in part because they may have additional knowledge/context that may be relevant in the deployment of a given patch. [but again, i welcome others confirming/denying my t [20:27:08] ^my thoughts here :) ] [20:27:08] done! [20:27:28] I'm looking over 1283048 and 1285482 - I'm a little confused as to why folks have scheduled others' patches without checking with the author [20:27:40] TheresNoTime: agreed (see above ^^) [20:28:03] A_smart_kitten: there's a limit to what you can ask of me :D [20:28:13] fair enough neriah :) no worries [20:28:23] (03CR) 10Jack who built the house: Show CAPTCHA if required for all edits before first edit attempt (031 comment) [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286324 (https://phabricator.wikimedia.org/T425955) (owner: 10Dreamy Jazz) [20:28:26] is VadymTS1 new-ish to the deploy process? it might be an issue of unclear expectations [20:28:33] I clearly remember a case like that. It could be something unusual - I'm not familiar enough with it. But the case you're talking about is definitely the most common one. [20:29:10] VadymTS1: (also, i'm not quite sure what you mean by ' other users trashing other people's changes'? apologies if it's something obvious i'm not understanding :) ) [20:29:52] (maybe by "trash" you mean "deploy"?) [20:30:08] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 117.30 ms [20:30:19] I accidentally wrote that, I meant deploy [20:30:48] FWIW CTT will often have "some content transform team" member deploy a patch on behalf of the team, it doesn't always have to be the author because we're usually familiar with the patch (and it could be the code reviewer, etc) [20:31:05] [ahh, thank you for the clarification :) ] [20:33:30] (03CR) 10Dzahn: "i can take care of getting a reserved user" [puppet] - 10https://gerrit.wikimedia.org/r/1286463 (owner: 10Dduvall) [20:34:06] ya, i meant deploy. i started writing something and changed it, and somehow forgot to change that word [20:34:21] english is not my native language... 🤦‍♂️ [20:34:39] i suppose the process in practise may be different for 'patches made by a volunteer in response to a wikimedia-site-requests-task', & for some other patches that get uploaded for the mediawiki-config repo. e.g. for patches uploaded by a member of a WMF team, it might be common that another member of that same team deploys the patch (as in csott's example). [20:34:47] but at least in my personal experience, for wikimedia-site-requests patches uploaded by a volunteer, it's _generally_ that volunteer that schedules it for deployment & is around in #wikimedia-operations [20:35:26] !log migrate link from cr2-esams to asw1-by27-esams to L2 trunk on the switch side T424611 [20:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:30] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [20:35:45] To keep things moving, if Neriah and VadymTS1 are familiar enough with the changes and are confident they can test them, then its probably okay to continue (cscott were you running the window, I missed the deployers ping but can take over if needed) [20:36:11] (CTT has sometimes taken ownership of a site config patch on behalf of a community, for what it's worth. Most recently we've been turning off the magic links features on a number of wikis, and i've handled deploying those patches once I've confirmed that the appropriate community consultation has occurred etc.) [20:36:29] (I was last to go, so whoever wants to proceed) [20:36:34] i wasn't running the window, i'm just an enthusiastic participant. :) [20:37:15] [re my prev message] exceptions do exist; e.g. I scheduled and was around for the deployment of https://gerrit.wikimedia.org/r/1192528, which wasn't my patch. but in that case it had been asked/arranged beforehand at https://phabricator.wikimedia.org/T406023#11417321 [20:37:30] my patches are just finishing up the 'check' in jenkins, so there's time for another spiderpig run before i jump in. [20:37:55] ack, Neriah are you ready for 1285482 ? [20:39:30] A_smart_kitten: I don't mind giving up my deploy, I don't really care about it that much. My general approach is just, "if I can do something myself, why should someone else have to do it?"... :) [20:39:30] Anyway, TheresNoTime, I'm familiar with the change and I can do testing. [20:39:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285482 (https://phabricator.wikimedia.org/T425806) (owner: 10Dreamrimmer) [20:40:45] (03Merged) 10jenkins-bot: Allow svwiki bureaucrats to remove sysop rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285482 (https://phabricator.wikimedia.org/T425806) (owner: 10Dreamrimmer) [20:41:10] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1285482|Allow svwiki bureaucrats to remove sysop rights (T425806)]] [20:41:14] T425806: Allow svwiki bureaucrats to remove sysop rights - https://phabricator.wikimedia.org/T425806 [20:41:25] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1271.eqiad.wmnet with reason: host reimage [20:41:29] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1271.eqiad.wmnet with reason: host reimage [20:41:30] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [20:41:39] Neriah: if TheresNoTime is okay with it then that's okay with me :) but only speaking personally, in general i would advise e.g. asking/checking with a site-request patch-author before scheduling a patch/patches on their behalf in future (if only to avoid them being surprised that it's been deployed without them having gone through the deployment process themselves) [20:42:03] cc VadymTS1 ^^ [20:42:04] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1270.eqiad.wmnet with reason: host reimage [20:42:56] yeah, I'll do it in the future [20:43:10] !log samtar@deploy1003 samtar, dreamrimmer: Backport for [[gerrit:1285482|Allow svwiki bureaucrats to remove sysop rights (T425806)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:43:12] I still have so much to learn./. [20:43:12] ok [20:43:13] (03CR) 10Dzahn: "turns out we already have UID 923 reserved for zuul:" [puppet] - 10https://gerrit.wikimedia.org/r/1286463 (owner: 10Dduvall) [20:43:30] Neriah: change is on mwdebug for testing, let me know if its okay to continue :) [20:44:48] !log migrate link from cr1-drmrs to asw1-b12-drmrs to L2 trunk on the switch side T424611 [20:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:52] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [20:45:37] TheresNoTime: Looks fine [20:46:05] !log samtar@deploy1003 samtar, dreamrimmer: Continuing with deployment [20:46:50] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for new link networks - cmooney@cumin1003" [20:46:59] Neriah: indeed (i also wouldn't be surprised if these sorts of norms might not be very well documented, so in a way maybe they should be better documented as well if so :) and i sympathise with there being a lot to learn... ) [20:47:12] (my patches are all now green, so i'm good to go) [20:47:28] (03PS1) 10Cathal Mooney: Add INCLUDEs for new IPs allocated for IBGP peering at POPs [dns] - 10https://gerrit.wikimedia.org/r/1286501 (https://phabricator.wikimedia.org/T424611) [20:47:50] cscott: will let you know when this is deployed [20:48:24] (03CR) 10CI reject: [V:04-1] Add INCLUDEs for new IPs allocated for IBGP peering at POPs [dns] - 10https://gerrit.wikimedia.org/r/1286501 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [20:48:39] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1270.eqiad.wmnet with reason: host reimage [20:49:55] cmooney@cumin1003 netbox (PID 3314375) is awaiting input [20:50:14] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1285482|Allow svwiki bureaucrats to remove sysop rights (T425806)]] (duration: 09m 03s) [20:50:17] T425806: Allow svwiki bureaucrats to remove sysop rights - https://phabricator.wikimedia.org/T425806 [20:50:50] cscott: do you want to do your patches now? [20:50:59] yes [20:51:03] Neriah: deployed :) [20:51:04] i can spiderpig [20:51:09] ack, all yours [20:51:10] thank you :)! [20:51:58] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1271.eqiad.wmnet with OS bookworm [20:52:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11914809 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1271.eqiad.wmnet with OS bookworm executed with errors: - db1271 (**F... [20:52:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286484 (https://phabricator.wikimedia.org/T409751) (owner: 10C. Scott Ananian) [20:52:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286485 (https://phabricator.wikimedia.org/T425981) (owner: 10C. Scott Ananian) [20:52:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/Kartographer] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286488 (owner: 10C. Scott Ananian) [20:52:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286489 (owner: 10C. Scott Ananian) [20:52:46] Then I'm after cscott, right? [20:53:39] (03Merged) 10jenkins-bot: Disable unit tests that fail with new vendor release [extensions/Kartographer] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286488 (owner: 10C. Scott Ananian) [20:53:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11914811 (10Dzahn) 05Stalled→03In progress [20:54:28] !log migrate link from cr2-drmrs to asw1-b12-drmrs to L2 trunk on the switch side T424611 [20:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:31] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260512T2100) [21:01:16] I'm going to have to step away for now, another deployer may be around to do the 2 remaining patches [21:01:35] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for new link networks - cmooney@cumin1003" [21:01:35] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:01:47] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [21:02:31] (03Merged) 10jenkins-bot: Skip ContentHolderTest that fails with new vendor release [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286489 (owner: 10C. Scott Ananian) [21:03:04] RECOVERY - SSH on an-druid1007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:03:20] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:03:47] !log migrate link from cr1-drmrs to asw1-b13-drmrs to L2 trunk on the switch side T424611 [21:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:51] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [21:03:53] ill be doing a deploy shortly [21:04:20] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:04:25] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a3 [vendor] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286484 (https://phabricator.wikimedia.org/T409751) (owner: 10C. Scott Ananian) [21:04:32] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a3 [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286485 (https://phabricator.wikimedia.org/T425981) (owner: 10C. Scott Ananian) [21:04:34] Jdlrobson: FYI I think scott is currently deploying [21:04:39] cscott * [21:04:52] np i can wait [21:05:01] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1286484|Bump wikimedia/parsoid to 0.24.0-a3 (T409751 T420336 T425981)]], [[gerrit:1286485|Bump wikimedia/parsoid to 0.24.0-a3 (T425981)]], [[gerrit:1286488|Disable unit tests that fail with new vendor release]], [[gerrit:1286489|Skip ContentHolderTest that fails with new vendor release]] [21:05:03] I am also left. [21:05:08] T409751: Lazy loading of data-mw and data-parsoid - https://phabricator.wikimedia.org/T409751 [21:05:09] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for new link networks - cmooney@cumin1003" [21:05:09] T420336: mw-parsoid improvements - https://phabricator.wikimedia.org/T420336 [21:05:09] T425981: CTT tasks week of 2026-05-08 - https://phabricator.wikimedia.org/T425981 [21:05:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for new link networks - cmooney@cumin1003" [21:05:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:05:28] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [21:05:29] (03PS2) 10Cathal Mooney: Add INCLUDEs for new IPs allocated for IBGP peering at POPs [dns] - 10https://gerrit.wikimedia.org/r/1286501 (https://phabricator.wikimedia.org/T424611) [21:05:47] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [21:05:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1270.eqiad.wmnet with OS bookworm [21:05:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11914894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1270.eqiad.wmnet with OS bookworm completed: - db1270 (**PASS**) -... [21:06:27] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1271.eqiad.wmnet with OS bookworm [21:06:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11914898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1271.eqiad.wmnet with OS bookworm [21:06:56] !log cscott@deploy1003 cscott: Backport for [[gerrit:1286484|Bump wikimedia/parsoid to 0.24.0-a3 (T409751 T420336 T425981)]], [[gerrit:1286485|Bump wikimedia/parsoid to 0.24.0-a3 (T425981)]], [[gerrit:1286488|Disable unit tests that fail with new vendor release]], [[gerrit:1286489|Skip ContentHolderTest that fails with new vendor release]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Change [21:06:56] s can now be verified there. [21:09:07] (03PS1) 10Bking: opensearch on k8s: Enable service mesh for clusters [puppet] - 10https://gerrit.wikimedia.org/r/1286504 (https://phabricator.wikimedia.org/T421293) [21:09:42] PROBLEM - Host asw1-b13-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [21:09:42] PROBLEM - Host tcp-proxy6002 is DOWN: PING CRITICAL - Packet loss = 100% [21:09:44] PROBLEM - Host doh6002 is DOWN: PING CRITICAL - Packet loss = 100% [21:09:56] RECOVERY - Host asw1-b13-drmrs is UP: PING OK - Packet loss = 0%, RTA = 92.63 ms [21:09:58] RECOVERY - Host doh6002 is UP: PING OK - Packet loss = 0%, RTA = 88.67 ms [21:10:04] PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 1h, 42 minutes. https://wikitech.wikimedia.org/wiki/Varnish [21:10:38] RECOVERY - Host tcp-proxy6002 is UP: PING OK - Packet loss = 0%, RTA = 87.75 ms [21:11:10] cscott: are you done deploying? [21:11:22] testing the deploy atm [21:13:19] ^^ above drmrs alerts were due to an error on my part, briefly caused disruption on traffic flowing to asw1-b13 in drmrs [21:13:27] recoveries should be incoming [21:13:48] (03PS2) 10CDanis: opensearch on k8s: Enable service mesh for clusters [puppet] - 10https://gerrit.wikimedia.org/r/1286504 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [21:13:50] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286504 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [21:15:34] !log migrate link from cr1-drmrs to asw1-b13-drmrs to L2 trunk on the switch side T424611 [21:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:37] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [21:15:44] !log cscott@deploy1003 cscott: Continuing with deployment [21:16:02] (03CR) 10Dreamy Jazz: [C:03+2] Show CAPTCHA if required for all edits before first edit attempt (031 comment) [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286324 (https://phabricator.wikimedia.org/T425955) (owner: 10Dreamy Jazz) [21:16:15] (03CR) 10Dreamy Jazz: [C:03+2] Show CAPTCHA if required for all edits before first edit attempt (031 comment) [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286324 (https://phabricator.wikimedia.org/T425955) (owner: 10Dreamy Jazz) [21:17:09] (03CR) 10Dreamy Jazz: [C:03+2] Show CAPTCHA if required for all edits before first edit attempt (031 comment) [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286324 (https://phabricator.wikimedia.org/T425955) (owner: 10Dreamy Jazz) [21:17:52] (03PS1) 10C. Scott Ananian: Re-enable unit tests with updated output [extensions/Kartographer] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286506 [21:19:53] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286484|Bump wikimedia/parsoid to 0.24.0-a3 (T409751 T420336 T425981)]], [[gerrit:1286485|Bump wikimedia/parsoid to 0.24.0-a3 (T425981)]], [[gerrit:1286488|Disable unit tests that fail with new vendor release]], [[gerrit:1286489|Skip ContentHolderTest that fails with new vendor release]] (duration: 14m 51s) [21:19:59] T409751: Lazy loading of data-mw and data-parsoid - https://phabricator.wikimedia.org/T409751 [21:19:59] T420336: mw-parsoid improvements - https://phabricator.wikimedia.org/T420336 [21:20:00] T425981: CTT tasks week of 2026-05-08 - https://phabricator.wikimedia.org/T425981 [21:20:20] Can someone deploy my changes? [21:20:30] (03CR) 10BCornwall: [C:03+1] Add INCLUDEs for new IPs allocated for IBGP peering at POPs (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1286501 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [21:21:05] ok i'm done. Jdlrobson we still have a few patches left over from the window (VadymTS1's) [21:21:55] i can deploy those for you VadymTS1 if Jdlrobson is willing to wait and VadymTS1 is able to test them once deployed. [21:23:10] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1271.eqiad.wmnet with reason: host reimage [21:23:38] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [21:25:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283048 (https://phabricator.wikimedia.org/T425440) (owner: 10Danielyepezgarces) [21:25:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286390 (https://phabricator.wikimedia.org/T425332) (owner: 10VadymTS1) [21:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -9d 7h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [21:25:24] (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDEs for new IPs allocated for IBGP peering at POPs [dns] - 10https://gerrit.wikimedia.org/r/1286501 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [21:26:22] (03Merged) 10jenkins-bot: Enabling RSS extension for cowikimedia chapter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283048 (https://phabricator.wikimedia.org/T425440) (owner: 10Danielyepezgarces) [21:26:24] (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDEs for new IPs allocated for IBGP peering at POPs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1286501 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [21:26:26] (03Merged) 10jenkins-bot: Set $wgSignatureAllowedLintErrors to an empty array on Spanish Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286390 (https://phabricator.wikimedia.org/T425332) (owner: 10VadymTS1) [21:26:28] cscott: i need to get these patches out [21:26:34] Am I allowed to overrun the window? [21:26:53] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1283048|Enabling RSS extension for cowikimedia chapter (T425440)]], [[gerrit:1286390|Set $wgSignatureAllowedLintErrors to an empty array on Spanish Wiktionary (T425332)]] [21:26:58] T425440: Enable RSS extension for cowikimedia - https://phabricator.wikimedia.org/T425440 [21:26:59] T425332: Set $wgSignatureAllowedLintErrors to an empty array on Spanish Wiktionary - https://phabricator.wikimedia.org/T425332 [21:27:22] cscott: need to get a couple of patches out for a deployment next week [21:27:47] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for new link networks - cmooney@cumin1003" [21:28:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for new link networks - cmooney@cumin1003" [21:28:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:28:21] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:28:51] !log cscott@deploy1003 danielyepezgarces, cscott, vadymts1: Backport for [[gerrit:1283048|Enabling RSS extension for cowikimedia chapter (T425440)]], [[gerrit:1286390|Set $wgSignatureAllowedLintErrors to an empty array on Spanish Wiktionary (T425332)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:28:58] ckeking [21:29:52] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1271.eqiad.wmnet with reason: host reimage [21:30:22] (03PS6) 10Neriah: upload: Return 400 instead of 429 for non-standard thumbnail sizes [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) [21:30:58] cscott: looks like your changes have synced to the test server [21:31:07] cscott: Everything is good [21:31:13] !log cscott@deploy1003 danielyepezgarces, cscott, vadymts1: Continuing with deployment [21:32:21] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:32:51] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [21:35:03] 10SRE-SLO, 06SRE Observability, 13Patch-For-Review: Grafana: deploy grafana-dashboard-reporter-app - https://phabricator.wikimedia.org/T425795#11914974 (10herron) [21:35:50] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11914976 (10VRiley-WMF) [21:37:48] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11914980 (10RobH) Scheduled a new site visit for them to go out this Friday @ 8AM Singapore Time so my Thursday @ 4PM. [21:38:01] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:01] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:01] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:01] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:01] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:01] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:03] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:13] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:13] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:13] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:13] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:15] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:15] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:38:18] cscott: ? [21:38:41] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1273] - vriley@cumin1003" [21:38:41] `Production checks failed. View the job log for details.` [21:38:46] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1273] - vriley@cumin1003" [21:38:46] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:38:50] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283048|Enabling RSS extension for cowikimedia chapter (T425440)]], [[gerrit:1286390|Set $wgSignatureAllowedLintErrors to an empty array on Spanish Wiktionary (T425332)]] (duration: 11m 56s) [21:38:55] T425440: Enable RSS extension for cowikimedia - https://phabricator.wikimedia.org/T425440 [21:38:55] T425332: Set $wgSignatureAllowedLintErrors to an empty array on Spanish Wiktionary - https://phabricator.wikimedia.org/T425332 [21:39:18] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1273 [21:39:52] Jdlrobson: i'm done [21:39:55] sorry about that delay [21:40:06] thanks cscott [21:40:17] cscott: thanks [21:40:39] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1272.eqiad.wmnet with OS bookworm [21:40:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11914989 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1272.eqiad.wmnet with OS bookworm [21:40:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286456 (https://phabricator.wikimedia.org/T422968) (owner: 10Jdlrobson) [21:41:29] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 2ff7e12578bf35cc5739e7af82073f6c26a067fb, dns.git is ba7438b5fc2279f2621b372a80266954127547a1) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:41:57] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1273 [21:42:31] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1273.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:42:55] (03PS1) 10Dwisehaupt: Move fundraising analytics servers to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1286511 (https://phabricator.wikimedia.org/T364186) [21:43:26] FIRING: [50x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:43:32] (03Merged) 10jenkins-bot: Disable interactions until load is complete [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286456 (https://phabricator.wikimedia.org/T422968) (owner: 10Jdlrobson) [21:43:56] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1286456|Disable interactions until load is complete (T422968 T424787)]] [21:44:01] T422968: Share Highlights: add loading state - https://phabricator.wikimedia.org/T422968 [21:44:01] T424787: Share Highlight: article title should not be bold within card when text is selected - https://phabricator.wikimedia.org/T424787 [21:45:35] there do seem to be a lot of the "accessing the language without explicitly setting it" error messages, but I don't think that's related to the config patches of VadymTS1. Looking into it. [21:46:34] cscott: does that look like (what's written in) T425988? if so iiuc it may be known [21:46:34] T425988: Deprecated: Accessing the language without explicitly setting it via MediaHandler:setLanguage, MediaHandler::getHandler, or MediaHandlerFactory::getHandler - https://phabricator.wikimedia.org/T425988 [21:46:51] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [21:47:22] T423911 [21:47:23] T423911: 1.47.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T423911 [21:47:56] (03PS1) 10Jdlrobson: Also merge views overflow into array-items [skins/Vector] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286513 (https://phabricator.wikimedia.org/T426115) [21:48:13] (03PS1) 10Jdlrobson: Also merge views overflow into array-items [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286514 (https://phabricator.wikimedia.org/T426115) [21:48:54] Jdlrobson: after you are done, I might go ahead and backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1286508 to quiet those warnings. [21:49:57] vriley@cumin1003 reimage (PID 3319937) is awaiting input [21:50:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [21:50:49] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1271.eqiad.wmnet with OS bookworm [21:51:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915022 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1271.eqiad.wmnet with OS bookworm completed: - db1271 (**PASS**) -... [21:54:19] cscott: looks ... stuck? [21:54:33] cscott: After this one I have one more to do (a 3 patch deploy) [21:54:59] it's been stuck on `21:44:51 K8s images build/push output redirected to /var/lib/spiderpig/scap-image-build-and-push-log` for 10 minutes [21:55:11] no rush. [21:55:26] i don't think it's stuck, i think it's rebuilding i18n which can take ~50min. [21:55:30] ah ok [21:55:38] (03CR) 10Jgreen: [C:03+2] Move fundraising analytics servers to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1286511 (https://phabricator.wikimedia.org/T364186) (owner: 10Dwisehaupt) [21:55:48] vriley@cumin1003 provision (PID 3325817) is awaiting input [21:55:55] (03CR) 10Jgreen: [C:03+1] Move fundraising analytics servers to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1286511 (https://phabricator.wikimedia.org/T364186) (owner: 10Dwisehaupt) [21:56:02] Jdlrobson: although i'm not confident in that diagnosis, since it does say "0 languages rebuilt" earlier in the log. [21:56:11] yeh.. [21:56:23] oh it moved! [21:57:15] (03CR) 10Dwisehaupt: [C:03+2] Move fundraising analytics servers to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1286511 (https://phabricator.wikimedia.org/T364186) (owner: 10Dwisehaupt) [21:57:19] huh. maybe l10n cache got faster? it does say "549 languages rebuilt out of 549" but only took ~20s to do that [21:57:29] !log dwisehaupt@dns1004 START - running authdns-update [21:57:46] Jdlrobson: anyway, no rush, just ping me when you're done. [21:57:47] (03CR) 10CI reject: [V:04-1] Also merge views overflow into array-items [skins/Vector] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286513 (https://phabricator.wikimedia.org/T426115) (owner: 10Jdlrobson) [21:58:01] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [21:59:05] !log dwisehaupt@dns1004 END - running authdns-update [22:01:29] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:01:30] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1286456|Disable interactions until load is complete (T422968 T424787)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:01:40] T422968: Share Highlights: add loading state - https://phabricator.wikimedia.org/T422968 [22:01:40] T424787: Share Highlight: article title should not be bold within card when text is selected - https://phabricator.wikimedia.org/T424787 [22:03:01] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:01] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:01] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:01] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:01] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:03] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:05] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:05] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:13] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:13] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:13] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:13] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:15] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:15] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [22:03:57] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1273.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:04:04] 👀 [22:05:06] (03CR) 10CDanis: [V:03+1 C:03+1] "PCC LGTM https://puppet-compiler.wmflabs.org/output/1286504/6721/deploy1003.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/1286504 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [22:05:20] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [22:07:48] I'm going to continue with that window since it started late. I don't see anything in the calendar so I hope that's okay? [22:09:52] i don't see anything after you, nor anyone from ops waiting [22:10:19] you have my completely non-authoritative permission ;-) [22:11:15] And you have my even less authoritative permission! [22:11:17] 06SRE, 10Wikimedia-Mailing-lists: Create mailing list for ukwiki arbcom - https://phabricator.wikimedia.org/T426108#11915097 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists/wikipedia-uk-arbcom.lists.wikimedia.org [22:11:29] (I accept no responsibility for what happens next) [22:13:39] jouncebot: nowandnext [22:13:39] No deployments scheduled for the next 7 hour(s) and 46 minute(s) [22:13:39] In 7 hour(s) and 46 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T0600) [22:13:43] Jdlrobson: yeah go for it :D [22:14:21] that sounded authoritative. ;) [22:15:06] (03PS1) 10C. Scott Ananian: Revert "Remove File::getHandler language fallback" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286515 (https://phabricator.wikimedia.org/T425988) [22:16:01] haha thanks cdanis [22:16:11] cdanis: when jon is done, I'd like to backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1286515 to suppress the logspam, and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Kartographer/+/1286506 and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1286444 to make wmf.2 CI better match the state of master. [22:16:13] dang, that cdanis is going places with that sort of authority... [22:16:56] cscott: that seems fine, so long as you're sticking around for a while afterwards [22:17:08] `Production checks failed.View the job log for details.` [22:17:12] got that again...? [22:17:24] i think that's related to the logspam [22:17:25] can ignore failure and continue deployment? [22:17:32] i hit 'r' for retry and it was happy [22:17:58] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286456|Disable interactions until load is complete (T422968 T424787)]] (duration: 34m 01s) [22:18:03] T422968: Share Highlights: add loading state - https://phabricator.wikimedia.org/T422968 [22:18:03] T424787: Share Highlight: article title should not be bold within card when text is selected - https://phabricator.wikimedia.org/T424787 [22:18:29] (03PS1) 10C. Scott Ananian: Re-enable ContentHolderTest with updated output [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286516 [22:18:35] yep that worked [22:18:47] ok hopefully this one goes alot quicker [22:18:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286514 (https://phabricator.wikimedia.org/T426115) (owner: 10Jdlrobson) [22:18:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/Vector] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286513 (https://phabricator.wikimedia.org/T426115) (owner: 10Jdlrobson) [22:18:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286421 (https://phabricator.wikimedia.org/T424910) (owner: 10Jdlrobson) [22:19:55] (03PS1) 10BCornwall: Revert "site: Set lvs1017 to insetup_noferm" [puppet] - 10https://gerrit.wikimedia.org/r/1286517 [22:20:20] (03CR) 10CI reject: [V:04-1] Revert "site: Set lvs1017 to insetup_noferm" [puppet] - 10https://gerrit.wikimedia.org/r/1286517 (owner: 10BCornwall) [22:20:44] (03Merged) 10jenkins-bot: Also merge views overflow into array-items [skins/Vector] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286514 (https://phabricator.wikimedia.org/T426115) (owner: 10Jdlrobson) [22:21:23] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1272.eqiad.wmnet with reason: host reimage [22:21:37] (03PS1) 10Jdlrobson: [Share Highlight] Exclude section edit links, footnotes from selection [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286518 (https://phabricator.wikimedia.org/T423658) [22:21:38] (03PS2) 10BCornwall: Revert "site: Set lvs1017 to insetup_noferm" [puppet] - 10https://gerrit.wikimedia.org/r/1286517 [22:21:47] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1272.eqiad.wmnet with reason: host reimage [22:22:34] (03PS3) 10BCornwall: Revert "site: Set lvs1017 to insetup_noferm" [puppet] - 10https://gerrit.wikimedia.org/r/1286517 [22:24:42] (03CR) 10Jack who built the house: Show CAPTCHA if required for all edits before first edit attempt (031 comment) [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286324 (https://phabricator.wikimedia.org/T425955) (owner: 10Dreamy Jazz) [22:25:35] (03CR) 10Eric Gardner: [C:03+1] [Share Highlight] Exclude section edit links, footnotes from selection [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286518 (https://phabricator.wikimedia.org/T423658) (owner: 10Jdlrobson) [22:25:45] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:26:34] (03Merged) 10jenkins-bot: Also merge views overflow into array-items [skins/Vector] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286513 (https://phabricator.wikimedia.org/T426115) (owner: 10Jdlrobson) [22:31:24] (03Merged) 10jenkins-bot: Special:Preferences: Display three options for thumbsizes [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286421 (https://phabricator.wikimedia.org/T424910) (owner: 10Jdlrobson) [22:31:59] (03CR) 10Dreamy Jazz: [C:03+2] Show CAPTCHA if required for all edits before first edit attempt (031 comment) [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286324 (https://phabricator.wikimedia.org/T425955) (owner: 10Dreamy Jazz) [22:32:00] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1286514|Also merge views overflow into array-items (T426115)]], [[gerrit:1286513|Also merge views overflow into array-items (T426115)]], [[gerrit:1286421|Special:Preferences: Display three options for thumbsizes (T424910)]] [22:32:05] T426115: Collapsed views items missing from "more" menu at narrow widths in Vector 2022 - https://phabricator.wikimedia.org/T426115 [22:32:06] T424910: Limit Special:Preferences thumbnail option to three options - small, regular and large - https://phabricator.wikimedia.org/T424910 [22:37:42] (03CR) 10Dreamy Jazz: [C:03+2] Show CAPTCHA if required for all edits before first edit attempt (031 comment) [extensions/DiscussionTools] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286324 (https://phabricator.wikimedia.org/T425955) (owner: 10Dreamy Jazz) [22:37:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11915171 (10BCornwall) [22:38:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11915172 (10BCornwall) [22:40:10] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [22:40:29] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [22:40:30] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1272.eqiad.wmnet with OS bookworm [22:40:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1272.eqiad.wmnet with OS bookworm completed: - db1272 (**WARN**) -... [22:41:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915192 (10VRiley-WMF) [22:42:07] (03PS1) 10Fabfur: haproxy,aptrepo: start testing haproxy-awslc [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) [22:42:48] (03PS4) 10BCornwall: Revert "site: Set lvs1017 to insetup_noferm" [puppet] - 10https://gerrit.wikimedia.org/r/1286517 (https://phabricator.wikimedia.org/T421421) [22:42:50] (03PS1) 10BCornwall: Add lvs1017 to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/1286522 (https://phabricator.wikimedia.org/T421421) [22:42:52] (03PS1) 10BCornwall: Remove lvs1016, promote lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1286523 (https://phabricator.wikimedia.org/T421421) [22:42:54] (03PS1) 10BCornwall: Remove hieradata/hosts/lvs1016.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1286524 (https://phabricator.wikimedia.org/T421421) [22:43:39] (03CR) 10CI reject: [V:04-1] Add lvs1017 to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/1286522 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [22:44:11] (03CR) 10CI reject: [V:04-1] haproxy,aptrepo: start testing haproxy-awslc [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [22:45:52] (03PS2) 10BCornwall: Add lvs1017 to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/1286522 (https://phabricator.wikimedia.org/T421421) [22:45:53] (03PS2) 10BCornwall: Remove lvs1016, promote lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1286523 (https://phabricator.wikimedia.org/T421421) [22:45:53] (03PS2) 10BCornwall: Remove hieradata/hosts/lvs1016.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1286524 (https://phabricator.wikimedia.org/T421421) [22:47:35] (03CR) 10Aklapper: "Thanks everyone for looking into this issue! Is this backport ready to get merged into the codebase and deployed? Just asking to avoid pot" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286515 (https://phabricator.wikimedia.org/T425988) (owner: 10C. Scott Ananian) [22:49:20] (03PS2) 10Fabfur: haproxy,aptrepo: start testing haproxy-awslc [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) [22:49:37] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1286514|Also merge views overflow into array-items (T426115)]], [[gerrit:1286513|Also merge views overflow into array-items (T426115)]], [[gerrit:1286421|Special:Preferences: Display three options for thumbsizes (T424910)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:49:42] T426115: Collapsed views items missing from "more" menu at narrow widths in Vector 2022 - https://phabricator.wikimedia.org/T426115 [22:49:43] T424910: Limit Special:Preferences thumbnail option to three options - small, regular and large - https://phabricator.wikimedia.org/T424910 [22:49:53] (03CR) 10CI reject: [V:04-1] haproxy,aptrepo: start testing haproxy-awslc [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [22:50:02] (03Abandoned) 10Fabfur: aptrepo,haproxy: add haproxy-awslc component/package [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [22:51:07] (03PS3) 10Fabfur: haproxy,aptrepo: start testing haproxy-awslc [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) [22:51:39] (03CR) 10CI reject: [V:04-1] haproxy,aptrepo: start testing haproxy-awslc [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [22:53:39] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [22:53:49] (03PS4) 10Fabfur: haproxy,aptrepo: start testing haproxy-awslc [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) [22:54:28] (03CR) 10CI reject: [V:04-1] haproxy,aptrepo: start testing haproxy-awslc [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [22:59:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11915233 (10BCornwall) @ssingh The group of patches I've uploaded add an extra step of adding lvs1017 to the end of the list and a lower priority - much like ho... [22:59:18] (03PS5) 10Fabfur: haproxy,aptrepo: start testing haproxy-awslc [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) [23:01:09] (03PS3) 10BCornwall: Remove lvs1016 hieradata, demote to insetup_noferm [puppet] - 10https://gerrit.wikimedia.org/r/1286524 (https://phabricator.wikimedia.org/T421421) [23:01:27] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [23:01:36] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1273.eqiad.wmnet with OS bookworm [23:01:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:01:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915246 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1273.eqiad.wmnet with OS bookworm [23:05:29] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286514|Also merge views overflow into array-items (T426115)]], [[gerrit:1286513|Also merge views overflow into array-items (T426115)]], [[gerrit:1286421|Special:Preferences: Display three options for thumbsizes (T424910)]] (duration: 33m 28s) [23:05:34] T426115: Collapsed views items missing from "more" menu at narrow widths in Vector 2022 - https://phabricator.wikimedia.org/T426115 [23:05:34] T424910: Limit Special:Preferences thumbnail option to three options - small, regular and large - https://phabricator.wikimedia.org/T424910 [23:05:48] Jdlrobson: done? [23:10:17] ok, i'm going to kick off my patches (final set of the day, hopefully!) [23:10:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/Kartographer] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286506 (owner: 10C. Scott Ananian) [23:10:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286516 (owner: 10C. Scott Ananian) [23:11:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286515 (https://phabricator.wikimedia.org/T425988) (owner: 10C. Scott Ananian) [23:15:33] (03CR) 10BCornwall: [V:03+2 C:03+1] "Wonderful! Thank you so much for the patch, Neriah. 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) (owner: 10Neriah) [23:17:20] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [23:19:09] (03CR) 10C. Scott Ananian: "I'm planning to backport it before the train rolls." [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286515 (https://phabricator.wikimedia.org/T425988) (owner: 10C. Scott Ananian) [23:19:20] (03PS8) 10Jdlrobson: Limit and standardize thumbnail options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152) [23:21:28] (03Merged) 10jenkins-bot: Re-enable unit tests with updated output [extensions/Kartographer] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286506 (owner: 10C. Scott Ananian) [23:21:48] (03PS8) 10Dzahn: codesearch: create script/timer to delete zombie lock files [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) [23:21:48] (03CR) 10Dzahn: codesearch: create script/timer to delete zombie lock files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285488 (https://phabricator.wikimedia.org/T421147) (owner: 10Dzahn) [23:22:08] (03Merged) 10jenkins-bot: Re-enable ContentHolderTest with updated output [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286516 (owner: 10C. Scott Ananian) [23:22:13] (03CR) 10CI reject: [V:04-1] Revert "Remove File::getHandler language fallback" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286515 (https://phabricator.wikimedia.org/T425988) (owner: 10C. Scott Ananian) [23:22:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286515 (https://phabricator.wikimedia.org/T425988) (owner: 10C. Scott Ananian) [23:23:13] (03CR) 10Dzahn: [C:03+2] zuul: Set mode of SSH private key to 0400 [puppet] - 10https://gerrit.wikimedia.org/r/1285923 (owner: 10Dduvall) [23:23:33] spurious failure on browser tests :( [23:30:30] (03PS1) 10Fabfur: hiera: using haproxy-awslc on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1286526 (https://phabricator.wikimedia.org/T419825) [23:30:45] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286526 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [23:32:20] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [23:33:00] (03Merged) 10jenkins-bot: Revert "Remove File::getHandler language fallback" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286515 (https://phabricator.wikimedia.org/T425988) (owner: 10C. Scott Ananian) [23:33:36] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1286506|Re-enable unit tests with updated output]], [[gerrit:1286516|Re-enable ContentHolderTest with updated output]], [[gerrit:1286515|Revert "Remove File::getHandler language fallback" (T425988)]] [23:33:40] T425988: Deprecated: Accessing the language without explicitly setting it via MediaHandler:setLanguage, MediaHandler::getHandler, or MediaHandlerFactory::getHandler - https://phabricator.wikimedia.org/T425988 [23:35:50] (03PS9) 10Jdlrobson: Limit and standardize thumbnail options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152) [23:39:14] !log cscott@deploy1003 cscott: Backport for [[gerrit:1286506|Re-enable unit tests with updated output]], [[gerrit:1286516|Re-enable ContentHolderTest with updated output]], [[gerrit:1286515|Revert "Remove File::getHandler language fallback" (T425988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:39:18] T425988: Deprecated: Accessing the language without explicitly setting it via MediaHandler:setLanguage, MediaHandler::getHandler, or MediaHandlerFactory::getHandler - https://phabricator.wikimedia.org/T425988 [23:39:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1286528 [23:39:56] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1286528 (owner: 10TrainBranchBot) [23:40:08] !log cscott@deploy1003 cscott: Continuing with deployment [23:45:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in magru #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=magru&var-cluster=text&var-origin=performance.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:46:22] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286506|Re-enable unit tests with updated output]], [[gerrit:1286516|Re-enable ContentHolderTest with updated output]], [[gerrit:1286515|Revert "Remove File::getHandler language fallback" (T425988)]] (duration: 12m 45s) [23:46:25] T425988: Deprecated: Accessing the language without explicitly setting it via MediaHandler:setLanguage, MediaHandler::getHandler, or MediaHandlerFactory::getHandler - https://phabricator.wikimedia.org/T425988 [23:46:50] ok done! [23:46:59] i'll stick around IRC for a while in case anything breaks [23:47:28] cscott: We are getting elevated 500s [23:47:40] not sure if it's related [23:48:14] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1273.eqiad.wmnet with reason: host reimage [23:48:21] fun. the "PHP Deprecated: Accessing the language without explicitly setting it via MediaHandler:setLanguage, MediaHandler::getHandler, or MediaHandlerFactory::getHandler was deprecated in 1.46. [Called from MediaWiki\Media\ImageHandler:: " logspam has stopped, though [23:50:38] brett: do we have a logstash entry for the 500? [23:51:36] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1286528 (owner: 10TrainBranchBot) [23:53:20] cscott: no, I had just noticed all dcs starting to climb in 500s (and an ats page) at the same time as the scap backport [23:53:54] i don't see anything unusual in logstash [23:54:16] !incidents [23:54:16] 7926 (UNACKED) ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet magru) [23:54:27] ack 7926 [23:54:29] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1273.eqiad.wmnet with reason: host reimage [23:54:41] !ack 7926 [23:54:42] 7926 (ACKED) ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet magru) [23:55:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh