[00:05:32] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#12016833 (10Ladsgroup) [00:08:56] 06SRE, 06ESEAP-Hub, 10Wikimedia-Mailing-lists: Requesting creation of eseap-youth mailing list - https://phabricator.wikimedia.org/T428844#12016835 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup 6 7 https://lists.wikimedia.org/postorius/lists/eseap-youth.lists.wikimedia.org/ [00:16:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:31:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:34] FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:54:05] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:12:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1301864 [01:12:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1301864 (owner: 10TrainBranchBot) [01:20:04] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1301864 (owner: 10TrainBranchBot) [02:01:07] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:39] FIRING: [3x] JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:05] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 58s) [02:17:58] !log making Dexbot a bot in cywiki (T428927) [02:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:04] T428927: cywiki bad wikitext - https://phabricator.wikimedia.org/T428927 [02:34:39] FIRING: [3x] JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:48] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:45:16] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [03:07:34] RESOLVED: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:04:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12016920 (10Papaul) @cmooney I took a look at the steps all look good to me for the RE. However I didn't see the steps to... [04:21:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:31:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:05] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:25:31] (03CR) 10Marostegui: "Thanks. So you meant the previous comment where you said: "The cookbook ran OK in itself but the scripts rejected the dbctl status (I susp" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [05:30:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es2046', diff saved to https://phabricator.wikimedia.org/P94104 and previous config saved to /var/cache/conftool/dbconfig/20260615-053041-marostegui.json [05:31:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on es2045.codfw.wmnet with reason: crash [05:31:33] (03PS1) 10Marostegui: es2046: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1301880 (https://phabricator.wikimedia.org/T428993) [05:31:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on es2046.codfw.wmnet with reason: cloning [05:34:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool es2046', diff saved to https://phabricator.wikimedia.org/P94105 and previous config saved to /var/cache/conftool/dbconfig/20260615-053403-marostegui.json [05:35:00] 10ops-codfw, 06DBA, 06DC-Ops: es2045 down - https://phabricator.wikimedia.org/T429113#12016980 (10Marostegui) p:05Triage→03Medium a:05Marostegui→03None [05:36:36] (03CR) 10Marostegui: [C:03+2] es2046: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1301880 (https://phabricator.wikimedia.org/T428993) (owner: 10Marostegui) [05:48:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc1021: Migration to 10.11.18 T428861 [05:48:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1021: Migration to 10.11.18 T428861 [05:48:48] T428861: Compile and package MariaDB 10.11.18 - https://phabricator.wikimedia.org/T428861 [05:49:31] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc2021: Migration to 10.11.18 T428861 [05:49:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2021: Migration to 10.11.18 T428861 [05:56:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc2021: Migration to 10.11.18 T428861 [05:56:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:56:45] T428861: Compile and package MariaDB 10.11.18 - https://phabricator.wikimedia.org/T428861 [05:56:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:56:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2021: Migration to 10.11.18 T428861 [05:57:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2021.codfw.wmnet,pc1021.eqiad.wmnet with reason: upgrading [05:57:35] (03PS1) 10Marostegui: Revert "es2046: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1301882 [05:58:22] (03CR) 10Marostegui: [C:03+2] Revert "es2046: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1301882 (owner: 10Marostegui) [05:59:02] !log install mariadb 10.11.18 on pc1 T428861 [05:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:20] (03PS2) 10Giuseppe Lavagetto: haproxy: get ipblock map directly from HP [puppet] - 10https://gerrit.wikimedia.org/r/1299939 (https://phabricator.wikimedia.org/T422249) [06:02:20] (03PS2) 10Giuseppe Lavagetto: haproxy: use ipblocks map created by hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1299940 (https://phabricator.wikimedia.org/T422249) [06:02:20] (03PS2) 10Giuseppe Lavagetto: haproxy: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/1299941 [06:02:21] (03PS1) 10Giuseppe Lavagetto: fetch_external_clouds_vendors_nets: commit changes to provenance map [puppet] - 10https://gerrit.wikimedia.org/r/1301883 (https://phabricator.wikimedia.org/T422249) [06:03:02] 10ops-eqiad, 06DC-Ops: Inbound errors on interface cr1-eqiad:ae2 (asw2-b-eqiad:ae1) - https://phabricator.wikimedia.org/T429116 (10phaultfinder) 03NEW [06:03:32] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10Thumbor: old file revisions missing of File:A_Warm_Shade_of_Ivory_-_Henry_Mancini_album_cover.jpg - https://phabricator.wikimedia.org/T428406#12017044 (10MPGuy2824) https://upload.wikimedia.org/wikipedia/en/archive/1/14/20260606041817%212026_AV... [06:04:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301675 (https://phabricator.wikimedia.org/T429095) (owner: 10VadymTS1) [06:09:49] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool pc2021: Migration to 10.11.18 T428861 [06:09:49] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:09:54] T428861: Compile and package MariaDB 10.11.18 - https://phabricator.wikimedia.org/T428861 [06:10:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:10:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2021: Migration to 10.11.18 T428861 [06:13:51] (03PS1) 10Marostegui: control-mariadb-10.11-trixie: New version [software] - 10https://gerrit.wikimedia.org/r/1301884 (https://phabricator.wikimedia.org/T428861) [06:24:41] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-trixie: New version [software] - 10https://gerrit.wikimedia.org/r/1301884 (https://phabricator.wikimedia.org/T428861) (owner: 10Marostegui) [06:26:10] (03Merged) 10jenkins-bot: control-mariadb-10.11-trixie: New version [software] - 10https://gerrit.wikimedia.org/r/1301884 (https://phabricator.wikimedia.org/T428861) (owner: 10Marostegui) [06:27:25] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [06:27:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2047: Upgrading es2047.codfw.wmnet [06:28:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2047: Upgrading es2047.codfw.wmnet [06:31:44] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2047.codfw.wmnet with OS trixie [06:34:48] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:35:36] (03CR) 10Muehlenhoff: [C:03+2] Add component/zookeeper34 [puppet] - 10https://gerrit.wikimedia.org/r/1301364 (https://phabricator.wikimedia.org/T428495) (owner: 10Muehlenhoff) [06:45:16] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [06:47:58] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2047.codfw.wmnet with reason: host reimage [06:51:08] (03CR) 10Ayounsi: [C:03+1] "Nice! FYI we might also be able to get rid of the transport-in next quarter with the switch to Katran: https://phabricator.wikimedia.org/T" [homer/public] - 10https://gerrit.wikimedia.org/r/1300900 (https://phabricator.wikimedia.org/T428886) (owner: 10Cathal Mooney) [06:53:16] !log imported zookeeper 3.4.13-6+wmf12u1 to component/zookeeper34 for bookworm-wikimedia T428495 [06:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:20] T428495: Migrate conf* hosts away from bullseye - https://phabricator.wikimedia.org/T428495 [06:55:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2047.codfw.wmnet with reason: host reimage [06:55:45] (03PS8) 10Arnaudb: trafficserver: add a map for gitlab as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) [06:55:52] (03PS2) 10Arnaudb: cache_text: add gitlab-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1296572 (https://phabricator.wikimedia.org/T425441) [06:58:20] (03PS3) 10Arnaudb: cache_text: add gitlab-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1296572 (https://phabricator.wikimedia.org/T425441) [07:00:05] Amir1, urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T0700). [07:00:05] VadymTS1 and atsukoito: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] \o [07:01:02] (03CR) 10Muehlenhoff: [C:03+2] Add cumin2003 in firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1301309 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [07:04:06] hiya! [07:04:17] o/ [07:05:03] I can deploy [07:05:11] ok [07:05:18] VadymTS1: any objections if I deploy both config changes at once? [07:05:48] Nothing, got changes at once [07:05:59] *go [07:08:52] (03CR) 10DCausse: [C:03+1] Switch wmgUseCalendar to false for dewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301675 (https://phabricator.wikimedia.org/T429095) (owner: 10VadymTS1) [07:09:05] (03CR) 10DCausse: [C:03+1] Add alias namespace for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300301 (https://phabricator.wikimedia.org/T428619) (owner: 10VadymTS1) [07:09:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301675 (https://phabricator.wikimedia.org/T429095) (owner: 10VadymTS1) [07:09:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300301 (https://phabricator.wikimedia.org/T428619) (owner: 10VadymTS1) [07:10:09] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:10:12] (03Merged) 10jenkins-bot: Switch wmgUseCalendar to false for dewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301675 (https://phabricator.wikimedia.org/T429095) (owner: 10VadymTS1) [07:10:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:10:16] (03Merged) 10jenkins-bot: Add alias namespace for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300301 (https://phabricator.wikimedia.org/T428619) (owner: 10VadymTS1) [07:11:17] (03CR) 10Slyngshede: [V:03+2 C:03+2] Re-enable WebAuthN [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1301355 (owner: 10Slyngshede) [07:11:29] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1301675|Switch wmgUseCalendar to false for dewikivoyage (T429095)]], [[gerrit:1300301|Add alias namespace for cswiki (T428619)]] [07:11:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2047.codfw.wmnet with OS trixie [07:11:36] T429095: Undeploy Extension:Calendar-Wikivoyage on German Wikivoyage - https://phabricator.wikimedia.org/T429095 [07:11:36] T428619: Add alias namespace for cswiki - https://phabricator.wikimedia.org/T428619 [07:11:46] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:12:43] (03CR) 10Muehlenhoff: [C:03+2] Add cumin2003 to DB firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1301323 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [07:13:50] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:15:00] 06SRE: staging.webrequest.page_view.dev0 taking up most space on kafka-jumbo - https://phabricator.wikimedia.org/T429088#12017211 (10elukey) @JMonton-WMF Hi! Could you please stop the application? I am not sure which one it is :) After that we'll clear the topic to regain space. [07:18:29] (03CR) 10Federico Ceratto: "Yes, that's correct: the `db-switchover` script itself wants a specific dbctl/MariaDB status that is different from what this cookbook doe" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [07:19:07] waiting for docker-pusher to finish [07:19:47] :+1: [07:20:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:20:17] (03CR) 10Marostegui: "Yes, at the moment we are not thinking about combining both (this cookbook + db-switchover)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [07:20:29] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:21:02] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:21:03] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:21:04] hm... it's been running for 5+ mins [07:21:58] (03CR) 10Marostegui: "I think we have just to document this under: https://wikitech.wikimedia.org/wiki/SRE/Data_Persistence/Databases/Runbooks and this should b" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [07:22:54] (03CR) 10Arnaudb: [C:03+2] gitlab: add gitlab-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1298744 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [07:23:05] !log arnaudb@dns1005 START - running authdns-update [07:23:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2047: Migration of es2047.codfw.wmnet completed [07:23:14] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cloudvirt1078.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:23:19] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1078.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:24:11] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cloudvirt1079.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:24:38] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:24:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1163 (T419635)', diff saved to https://phabricator.wikimedia.org/P94110 and previous config saved to /var/cache/conftool/dbconfig/20260615-072446-fceratto.json [07:24:48] !log arnaudb@dns1005 END - running authdns-update [07:24:51] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:25:05] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1079.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:25:53] ok image push finally done, waiting for k8s deployments [07:26:06] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cloudvirt1080.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:26:45] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1080.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:27:40] (03CR) 10Arnaudb: [C:03+2] gitlab: support extra ssh host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/1298771 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [07:28:07] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12017233 (10elukey) 05Stalled→03Open @Jclark-ctr I am running a new version of the provision cookbook as a test and I see some serial-related errors for cloudvirt1078-1079-1080. The error sa... [07:28:19] jouncebot: next [07:28:19] In 2 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1000) [07:28:45] !log dcausse@deploy1003 vadymts1, dcausse: Backport for [[gerrit:1301675|Switch wmgUseCalendar to false for dewikivoyage (T429095)]], [[gerrit:1300301|Add alias namespace for cswiki (T428619)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:28:51] T429095: Undeploy Extension:Calendar-Wikivoyage on German Wikivoyage - https://phabricator.wikimedia.org/T429095 [07:28:52] T428619: Add alias namespace for cswiki - https://phabricator.wikimedia.org/T428619 [07:28:58] testing [07:29:01] thanks! [07:31:18] !log cwilliams@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on db-test2001.codfw.wmnet with reason: Testing [07:31:52] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:32:09] all good [07:33:07] VadymTS1: ok, shipping, will run namespaceDupe after the deploy [07:33:22] !log dcausse@deploy1003 vadymts1, dcausse: Continuing with deployment [07:35:33] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10Thumbor: old file revisions missing of File:A_Warm_Shade_of_Ivory_-_Henry_Mancini_album_cover.jpg - https://phabricator.wikimedia.org/T428406#12017253 (10MatthewVernon) We only keep swift logs for a few days (because of the large volume of them... [07:37:37] 10SRE-swift-storage, 06Commons, 06DBA, 10MediaWiki-File-management, 10Thumbor: old file revisions missing of File:A_Warm_Shade_of_Ivory_-_Henry_Mancini_album_cover.jpg - https://phabricator.wikimedia.org/T428406#12017269 (10MatthewVernon) If I'm reading the ticket correctly, this looks like a rename on s... [07:39:33] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:42:35] (03PS1) 10Slyngshede: P:idp-test fix webauthn database authentication [puppet] - 10https://gerrit.wikimedia.org/r/1301890 [07:42:41] 10SRE-swift-storage, 06Commons, 06DBA, 10media-backups, and 2 others: old file revisions missing of File:A_Warm_Shade_of_Ivory_-_Henry_Mancini_album_cover.jpg - https://phabricator.wikimedia.org/T428406#12017274 (10jcrespo) [07:42:46] (03CR) 10Muehlenhoff: [C:03+2] Add mysql grant for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1301324 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [07:43:25] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:44:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T419635)', diff saved to https://phabricator.wikimedia.org/P94112 and previous config saved to /var/cache/conftool/dbconfig/20260615-074417-fceratto.json [07:44:23] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:45:47] deployment progress: 93% almost there... [07:46:05] (03CR) 10Slyngshede: [C:03+2] P:idp-test fix webauthn database authentication [puppet] - 10https://gerrit.wikimedia.org/r/1301890 (owner: 10Slyngshede) [07:46:06] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1301675|Switch wmgUseCalendar to false for dewikivoyage (T429095)]], [[gerrit:1300301|Add alias namespace for cswiki (T428619)]] (duration: 34m 37s) [07:46:12] T429095: Undeploy Extension:Calendar-Wikivoyage on German Wikivoyage - https://phabricator.wikimedia.org/T429095 [07:46:12] T428619: Add alias namespace for cswiki - https://phabricator.wikimedia.org/T428619 [07:46:39] Thanks [07:46:43] VadymTS1: your changes should be live, I'll run namespaceDupes on cswiki and post the output on the task [07:47:02] atsukoito: your turn :) [07:47:23] dcausse: i'll go backport the config, thanks [07:47:27] !log dcausse@deploy1003 mwscript-k8s job started: namespaceDupes cswiki --fix # T428619 [07:48:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by atsuko@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301373 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:49:36] (03Merged) 10jenkins-bot: translate: production opensearch on k8s endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301373 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:49:54] !log atsuko@deploy1003 Started scap sync-world: Backport for [[gerrit:1301373|translate: production opensearch on k8s endpoints (T425377)]] [07:49:59] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [07:52:07] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:53:47] !log atsuko@deploy1003 atsuko: Backport for [[gerrit:1301373|translate: production opensearch on k8s endpoints (T425377)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:54:16] dcausse: testing config and debug servers [07:54:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P94114 and previous config saved to /var/cache/conftool/dbconfig/20260615-075425-fceratto.json [07:55:04] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:55:05] atsukoito: tested the usual Special pages and seems good from my end [07:55:09] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:57:13] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [07:57:18] configs shows that the prod wikis are getting prod servers, and the `testwiki` gets test server [07:57:20] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [07:58:01] (03CR) 10Arnaudb: [C:03+2] gitlab: advertise gitlab-ssh url on gitlab replicas [puppet] - 10https://gerrit.wikimedia.org/r/1298781 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [07:58:18] (03PS5) 10Arnaudb: gitlab: advertise gitlab-ssh url on gitlab replicas [puppet] - 10https://gerrit.wikimedia.org/r/1298781 (https://phabricator.wikimedia.org/T425441) [07:58:42] i'm re-creating the indices with clean and proceeding [07:58:47] (03PS2) 10Arnaudb: gitlab: advertise gitlab-ssh url on gitlab primary [puppet] - 10https://gerrit.wikimedia.org/r/1300763 (https://phabricator.wikimedia.org/T425441) [07:58:53] atsukoito: sounds good [08:01:20] (03CR) 10Arnaudb: [C:03+2] gitlab: advertise gitlab-ssh url on gitlab replicas [puppet] - 10https://gerrit.wikimedia.org/r/1298781 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [08:03:34] !log atsuko@deploy1003 atsuko: Continuing with deployment [08:04:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P94115 and previous config saved to /var/cache/conftool/dbconfig/20260615-080432-fceratto.json [08:04:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:07:23] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [08:08:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2047: Migration of es2047.codfw.wmnet completed [08:08:36] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129 (10MoritzMuehlenhoff) 03NEW [08:08:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [08:09:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:10:49] !log atsuko@deploy1003 Finished scap sync-world: Backport for [[gerrit:1301373|translate: production opensearch on k8s endpoints (T425377)]] (duration: 20m 54s) [08:10:55] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [08:14:04] dcausse: documents has started flowing into prod indices, good job! [08:14:13] atsukoito: nice! [08:14:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T419635)', diff saved to https://phabricator.wikimedia.org/P94117 and previous config saved to /var/cache/conftool/dbconfig/20260615-081440-fceratto.json [08:14:46] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:14:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [08:20:30] (03PS2) 10Muehlenhoff: Also sync firmwares to cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1301331 (https://phabricator.wikimedia.org/T427897) [08:20:44] (03CR) 10Muehlenhoff: [C:03+2] ganeti: Grant RAPI access to cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1301328 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [08:21:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:11] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:31:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:35:42] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:36:24] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:36:39] (03CR) 10Jaime Nuche: [C:03+1] releases: mask tmp.mount [puppet] - 10https://gerrit.wikimedia.org/r/1301400 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [08:40:38] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:41:17] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:43:54] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:44:28] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:45:39] (03PS9) 10Arnaudb: trafficserver: add a map for gitlab as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) [08:45:51] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver: apply [08:45:55] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver: apply [08:46:01] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ttmserver: apply [08:46:08] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver: apply [08:49:34] (03PS4) 10Atsuko: opensearch-ttmserver: increase memory to 1x index [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300813 (https://phabricator.wikimedia.org/T425377) [08:51:47] (03CR) 10Brouberol: opensearch-ttmserver: increase memory to 1x index (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300813 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [08:53:12] (03CR) 10Atsuko: [C:03+2] opensearch-ttmserver: increase memory to 1x index (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300813 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [08:53:14] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:53:25] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2037: Upgrading es2037.codfw.wmnet [08:53:32] (03CR) 10Jelto: [C:03+1] ssh-client-config: add gitlab-ssh.wikimedia.org [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1300101 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [08:53:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2037: Upgrading es2037.codfw.wmnet [08:55:16] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2037.codfw.wmnet with OS trixie [08:55:34] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: remove toolschecker-based checks [puppet] - 10https://gerrit.wikimedia.org/r/1298742 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [08:55:43] (03PS5) 10Filippo Giunchedi: icinga: remove toolschecker-based checks [puppet] - 10https://gerrit.wikimedia.org/r/1298742 (https://phabricator.wikimedia.org/T313030) [08:56:08] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2037.codfw.wmnet with OS trixie [08:56:54] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] icinga: remove toolschecker-based checks [puppet] - 10https://gerrit.wikimedia.org/r/1298742 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [08:57:58] (03PS4) 10Filippo Giunchedi: toolforge: remove checker access from k8s::etcd [puppet] - 10https://gerrit.wikimedia.org/r/1299546 (https://phabricator.wikimedia.org/T313030) [08:57:58] (03PS4) 10Filippo Giunchedi: Remove toolschecker role/profile [puppet] - 10https://gerrit.wikimedia.org/r/1299547 (https://phabricator.wikimedia.org/T313030) [08:58:10] (03PS2) 10Jcrespo: admin: Add new systemctl alias and update $? output for jynus [puppet] - 10https://gerrit.wikimedia.org/r/1169640 [08:59:08] marostegui@cumin1003 major-upgrade (PID 4010065) is awaiting input [08:59:28] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [08:59:47] (03PS3) 10Jcrespo: admin: Add new systemctl alias and update $? output for jynus [puppet] - 10https://gerrit.wikimedia.org/r/1169640 [09:02:11] (03Merged) 10jenkins-bot: opensearch-ttmserver: increase memory to 1x index [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300813 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [09:03:21] jouncebot: nowandnext [09:03:21] No deployments scheduled for the next 0 hour(s) and 56 minute(s) [09:03:21] In 0 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1000) [09:05:20] (03CR) 10CWilliams: [C:03+1] "screen tempted me to -1 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1169640 (owner: 10Jcrespo) [09:06:17] (03PS4) 10Jcrespo: admin: Add new systemctl alias and update $? output for jynus [puppet] - 10https://gerrit.wikimedia.org/r/1169640 [09:06:28] (03PS5) 10Jcrespo: admin: Add new systemctl alias and update $? output for jynus [puppet] - 10https://gerrit.wikimedia.org/r/1169640 [09:07:44] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:07:50] (03CR) 10Filippo Giunchedi: [C:03+2] toolforge: remove checker access from k8s::etcd [puppet] - 10https://gerrit.wikimedia.org/r/1299546 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [09:08:15] (03CR) 10Filippo Giunchedi: [C:03+2] Remove toolschecker role/profile [puppet] - 10https://gerrit.wikimedia.org/r/1299547 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [09:08:34] PROBLEM - OSPF status on cr1-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:09:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/50 (Core: cr2-esams:et-0/0/0 {#30369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-bw27-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:10:36] (03PS1) 10Elukey: sre.hosts.reimage: use datetime with timezone-aware objects [cookbooks] - 10https://gerrit.wikimedia.org/r/1302088 (https://phabricator.wikimedia.org/T429125) [09:10:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr2-esams (185.15.59.158) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:10:52] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:et-0/0/0 (Core: asw1-bw27-esams:et-0/0/50 {#30369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:12:04] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [09:12:12] (03CR) 10Elukey: [C:03+1] Also sync firmwares to cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1301331 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [09:12:24] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:12:55] (03PS1) 10Atsuko: Revert "opensearch-ttmserver: increase memory to 1x index" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302089 [09:13:42] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:13:46] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:14:27] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2037.codfw.wmnet with OS trixie [09:15:45] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:17:05] 06SRE, 10SRE-Access-Requests: Rotating production SSH-Key for @Michael to a Yubikey-based one - https://phabricator.wikimedia.org/T428037#12017693 (10Michael) >>! In T428037#11999846, @Raine wrote: >>>! In T428037#11999819, @Michael wrote: >> Thank you @Raine, I can confirm that I can use the new key to co... [09:17:42] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:20:15] (03PS1) 10Brouberol: cache/text: set caching to pass for kafka.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/1302093 (https://phabricator.wikimedia.org/T428053) [09:20:55] (03CR) 10Elukey: [C:03+1] cache/text: set caching to pass for kafka.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/1302093 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:22:54] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:22:56] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:24:04] (03PS20) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [09:24:52] 06SRE, 10SRE-Access-Requests: Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148 (10mfossati) 03NEW [09:25:53] (03CR) 10Brouberol: [C:03+2] cache/text: set caching to pass for kafka.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/1302093 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:26:20] (03PS2) 10Elukey: Use datetime with timezone-aware objects [cookbooks] - 10https://gerrit.wikimedia.org/r/1302088 (https://phabricator.wikimedia.org/T429125) [09:29:55] hey, who here already closed a wiki and can double check I didn't miss anything in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1301341 ? [09:30:33] 06SRE, 10SRE-Access-Requests: Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148#12017771 (10mfossati) [09:30:47] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2037.codfw.wmnet with reason: host reimage [09:30:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:31:22] claime: lads.groups is probably who you want [09:31:41] p858snake|cloud: yeah I know but probably not awake rn :D [09:32:25] (03PS1) 10CWilliams: T429114: sre.mysql.depool failing with downtime for parsercache [cookbooks] - 10https://gerrit.wikimedia.org/r/1302098 (https://phabricator.wikimedia.org/T429114) [09:32:30] theres a guide on wikitech iirc [09:32:57] p858snake|cloud: yep, that's what I followed, but I'd be more comfortable with a sanity check :D It can wait a couple hours it's fine [09:32:59] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:34:21] (03CR) 10Marostegui: [C:03+1] T429114: sre.mysql.depool failing with downtime for parsercache [cookbooks] - 10https://gerrit.wikimedia.org/r/1302098 (https://phabricator.wikimedia.org/T429114) (owner: 10CWilliams) [09:35:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2037.codfw.wmnet with reason: host reimage [09:35:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:36:56] (03CR) 10Gmodena: Added DNS entries for the new WDQS 2 deployments in DSE K8s. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1301301 (https://phabricator.wikimedia.org/T428925) (owner: 10Trueg) [09:40:35] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist extensions/Translate/scripts/ttmserver-export.php --ttmserver eqiad-k8s # T425377: populating translation memory (ttmserver-export.php) on eqiad-k8s (dblist: https://phabricator.wikimedia.org/P94120) [09:40:40] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [09:42:32] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:43:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:43:15] (03PS1) 10Sergio Gimeno: migrateMentorStatusAway: ensure validateStrictly receives objects [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302100 (https://phabricator.wikimedia.org/T409170) [09:43:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302100 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [09:43:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300835 (https://phabricator.wikimedia.org/T365889) (owner: 10Sergio Gimeno) [09:44:04] (03CR) 10Dreamy Jazz: [C:04-1] "`wmgUseAbuseFilter` needs to list `apiportalwiki` as having it explictly enabled as it's `abuse_filter_log` table has entries per https://" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301341 (https://phabricator.wikimedia.org/T427537) (owner: 10Clément Goubert) [09:44:57] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [09:45:40] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [09:45:41] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [09:46:17] (03PS1) 10Jelto: aptrepo: update gitlab and gitlab-runner to version 19.0 [puppet] - 10https://gerrit.wikimedia.org/r/1302101 (https://phabricator.wikimedia.org/T426164) [09:46:24] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [09:47:09] (03CR) 10Dreamy Jazz: Close API Portal wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301341 (https://phabricator.wikimedia.org/T427537) (owner: 10Clément Goubert) [09:48:00] (03CR) 10Dreamy Jazz: "Actually those instructions are out of date. I'll update them" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301341 (https://phabricator.wikimedia.org/T427537) (owner: 10Clément Goubert) [09:48:18] Dreamy_Jazz: Appreciate the help <3 [09:49:38] if there are no objections, i plan to use the infra window to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1301338 [09:50:11] (03CR) 10FNegri: "Thanks @marostegui@wikimedia.org. I was interested to know if you ever used this option in prod or if you know of any gotchas when using i" [puppet] - 10https://gerrit.wikimedia.org/r/1298835 (https://phabricator.wikimedia.org/T409857) (owner: 10FNegri) [09:51:18] Np :D [09:51:30] (03CR) 10Dreamy Jazz: [C:03+1] Close API Portal wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301341 (https://phabricator.wikimedia.org/T427537) (owner: 10Clément Goubert) [09:54:04] (03CR) 10Cathal Mooney: [C:03+2] Nokia: enable DHCP relay and IPv6 RAs on all IRB sub-ints [homer/public] - 10https://gerrit.wikimedia.org/r/1301359 (https://phabricator.wikimedia.org/T428908) (owner: 10Cathal Mooney) [09:54:31] (03CR) 10Dreamy Jazz: [C:03+1] "Not sure if we still want the `groupAdd` and `groupRemove` entries:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301341 (https://phabricator.wikimedia.org/T427537) (owner: 10Clément Goubert) [09:54:33] (03CR) 10Muehlenhoff: [C:03+2] Also sync firmwares to cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1301331 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [09:54:51] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:56:23] (03CR) 10Muehlenhoff: [C:03+2] ssh-client-config: add gitlab-ssh.wikimedia.org [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1300101 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [09:56:25] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] ssh-client-config: add gitlab-ssh.wikimedia.org [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1300101 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [09:56:32] (03CR) 10Marostegui: [C:03+1] "We do not use it in production." [puppet] - 10https://gerrit.wikimedia.org/r/1298835 (https://phabricator.wikimedia.org/T409857) (owner: 10FNegri) [09:56:41] (03Merged) 10jenkins-bot: Nokia: enable DHCP relay and IPv6 RAs on all IRB sub-ints [homer/public] - 10https://gerrit.wikimedia.org/r/1301359 (https://phabricator.wikimedia.org/T428908) (owner: 10Cathal Mooney) [09:57:34] (03CR) 10CI reject: [V:04-1] migrateMentorStatusAway: ensure validateStrictly receives objects [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302100 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [09:58:12] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302105 [09:58:20] (03CR) 10Elukey: [C:03+2] "Manuel tested reimage and everything went fine, I am inclined to merge to unblock other folks. Please ping me if you see anything weird!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1302088 (https://phabricator.wikimedia.org/T429125) (owner: 10Elukey) [09:58:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2037.codfw.wmnet with OS trixie [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1000) [10:00:52] (03CR) 10Clément Goubert: "I don't know either, but there will be some more cleanup once all traffic is redirected away from `apiportalwiki` by https://phabricator.w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301341 (https://phabricator.wikimedia.org/T427537) (owner: 10Clément Goubert) [10:01:25] bjensen: go for it [10:01:37] I'll do my wiki closure afterwards [10:01:54] (03CR) 10Blake: [C:03+2] mediawiki: Use utf-8 for text/plain and text/html. [puppet] - 10https://gerrit.wikimedia.org/r/1301338 (https://phabricator.wikimedia.org/T428772) (owner: 10Blake) [10:02:26] (03PS1) 10Filippo Giunchedi: toolforge: remove toolschecker from legacy redirector [puppet] - 10https://gerrit.wikimedia.org/r/1302110 (https://phabricator.wikimedia.org/T313030) [10:04:16] (03CR) 10CWilliams: [C:03+2] T429114: sre.mysql.depool failing with downtime for parsercache [cookbooks] - 10https://gerrit.wikimedia.org/r/1302098 (https://phabricator.wikimedia.org/T429114) (owner: 10CWilliams) [10:04:39] FIRING: SystemdUnitFailed: stunnel4.service on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:54] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [10:07:08] (03CR) 10Gmodena: Added DNS entries for the new WDQS 2 deployments in DSE K8s. (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1301301 (https://phabricator.wikimedia.org/T428925) (owner: 10Trueg) [10:08:20] (03PS2) 10CWilliams: T429114: sre.mysql.depool failing with downtime for parsercache [cookbooks] - 10https://gerrit.wikimedia.org/r/1302098 (https://phabricator.wikimedia.org/T429114) [10:08:37] (03PS1) 10Clément Goubert: redirects.dat: Funnel api.w.o to mw.o/wiki/Wikimedia_APIs [puppet] - 10https://gerrit.wikimedia.org/r/1302106 (https://phabricator.wikimedia.org/T418492) [10:08:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2037: repool after upgrade [10:10:37] !log blake@deploy1003 Started scap sync-world: apache config change (T428772) [10:10:42] T428772: Serve mediawiki keys.txt with UTF-8 charset - https://phabricator.wikimedia.org/T428772 [10:11:29] !log blake@deploy1003 blake: apache config change (T428772) synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:11:48] (03CR) 10Dreamy Jazz: [C:03+1] "Probably should be fine to leave for now in case stewards need to do rights changes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301341 (https://phabricator.wikimedia.org/T427537) (owner: 10Clément Goubert) [10:12:18] !log blake@deploy1003 blake: Continuing with deployment [10:12:39] (03CR) 10CWilliams: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1302098 (https://phabricator.wikimedia.org/T429114) (owner: 10CWilliams) [10:12:44] (03PS1) 10Federico Ceratto: filtered_tables.txt: drop il_to column from imagelinks table [puppet] - 10https://gerrit.wikimedia.org/r/1302113 (https://phabricator.wikimedia.org/T419635) [10:13:22] (03CR) 10Zabe: [C:03+1] filtered_tables.txt: drop il_to column from imagelinks table [puppet] - 10https://gerrit.wikimedia.org/r/1302113 (https://phabricator.wikimedia.org/T419635) (owner: 10Federico Ceratto) [10:13:33] (03CR) 10Marostegui: [C:03+1] filtered_tables.txt: drop il_to column from imagelinks table [puppet] - 10https://gerrit.wikimedia.org/r/1302113 (https://phabricator.wikimedia.org/T419635) (owner: 10Federico Ceratto) [10:16:14] (03PS3) 10Daniel Kinzler: smokepy: Add interactive pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301423 (https://phabricator.wikimedia.org/T424825) [10:16:18] !log blake@deploy1003 Finished scap sync-world: apache config change (T428772) (duration: 06m 41s) [10:16:22] T428772: Serve mediawiki keys.txt with UTF-8 charset - https://phabricator.wikimedia.org/T428772 [10:16:37] claime: i'm done, thanks! [10:18:16] bjensen: all good? [10:18:16] (03CR) 10Muehlenhoff: [C:03+2] Retire the Ubuntu mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294284 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [10:19:11] claime: yup, httpbb tests all passed, scap was happy, my browser loads the file with unicode-containing names in an unmangled way [10:19:29] bjensen: awesome, good job [10:19:45] (03Merged) 10jenkins-bot: T429114: sre.mysql.depool failing with downtime for parsercache [cookbooks] - 10https://gerrit.wikimedia.org/r/1302098 (https://phabricator.wikimedia.org/T429114) (owner: 10CWilliams) [10:21:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301341 (https://phabricator.wikimedia.org/T427537) (owner: 10Clément Goubert) [10:23:16] (03Merged) 10jenkins-bot: Close API Portal wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301341 (https://phabricator.wikimedia.org/T427537) (owner: 10Clément Goubert) [10:23:33] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1301341|Close API Portal wiki (T427537)]] [10:23:38] T427537: Close the API Portal wiki - https://phabricator.wikimedia.org/T427537 [10:24:22] (03PS2) 10Hnowlan: sre: Add sre.metamonitoring.downtime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1301397 (https://phabricator.wikimedia.org/T429020) [10:24:51] (03PS1) 10Filippo Giunchedi: openstack: deprecate ensure_running_kvm_instances check [puppet] - 10https://gerrit.wikimedia.org/r/1302114 (https://phabricator.wikimedia.org/T328502) [10:25:06] (03PS2) 10Daniel Kinzler: smokepy: use live mount for test files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302104 (https://phabricator.wikimedia.org/T424825) [10:25:20] !log cgoubert@deploy1003 cgoubert: Backport for [[gerrit:1301341|Close API Portal wiki (T427537)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:25:51] (03PS3) 10Daniel Kinzler: smokepy tests: share helm-test pod via vendor module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298850 [10:26:02] (03PS8) 10Daniel Kinzler: rest-gateway: run smokepy tests via helm test (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297668 (https://phabricator.wikimedia.org/T424825) [10:26:15] (03CR) 10Hnowlan: sre: Add sre.metamonitoring.downtime cookbook (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1301397 (https://phabricator.wikimedia.org/T429020) (owner: 10Hnowlan) [10:26:32] !log cgoubert@deploy1003 cgoubert: Continuing with deployment [10:27:34] RESOLVED: SystemdUnitFailed: stunnel4.service on cumin1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:36] (03CR) 10Arnaudb: [C:03+1] "lgtm, good luck with the upgrade!" [puppet] - 10https://gerrit.wikimedia.org/r/1302101 (https://phabricator.wikimedia.org/T426164) (owner: 10Jelto) [10:28:09] (03CR) 10FNegri: "> if opens a transaction, does some non-DB work (API call, computation) for >60s before the next query will get the connection killed. The" [puppet] - 10https://gerrit.wikimedia.org/r/1298835 (https://phabricator.wikimedia.org/T409857) (owner: 10FNegri) [10:29:17] (03PS1) 10Muehlenhoff: Absent the NRPE mirror check [puppet] - 10https://gerrit.wikimedia.org/r/1302115 (https://phabricator.wikimedia.org/T416707) [10:29:18] (03CR) 10Federico Ceratto: [C:03+2] filtered_tables.txt: drop il_to column from imagelinks table [puppet] - 10https://gerrit.wikimedia.org/r/1302113 (https://phabricator.wikimedia.org/T419635) (owner: 10Federico Ceratto) [10:30:19] (03CR) 10Jelto: "That would be a great addition. However a few thoughts:" [alerts] - 10https://gerrit.wikimedia.org/r/1301233 (https://phabricator.wikimedia.org/T428979) (owner: 10Arnaudb) [10:30:50] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1301341|Close API Portal wiki (T427537)]] (duration: 07m 16s) [10:30:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1302101 (https://phabricator.wikimedia.org/T426164) (owner: 10Jelto) [10:30:55] T427537: Close the API Portal wiki - https://phabricator.wikimedia.org/T427537 [10:32:06] (03CR) 10Muehlenhoff: [C:03+2] Absent the NRPE mirror check [puppet] - 10https://gerrit.wikimedia.org/r/1302115 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [10:33:42] (03CR) 10Trueg: Added DNS entries for the new WDQS 2 deployments in DSE K8s. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1301301 (https://phabricator.wikimedia.org/T428925) (owner: 10Trueg) [10:34:48] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:40:11] (03PS1) 10Muehlenhoff: Remove the remaining bits of the Ubuntu mirror [puppet] - 10https://gerrit.wikimedia.org/r/1302117 (https://phabricator.wikimedia.org/T416707) [10:44:51] (03CR) 10Muehlenhoff: [C:03+2] Remove the remaining bits of the Ubuntu mirror [puppet] - 10https://gerrit.wikimedia.org/r/1302117 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [10:45:16] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [10:51:19] (03PS2) 10Clément Goubert: [PageViewInfo] Add new config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300892 (https://phabricator.wikimedia.org/T411771) (owner: 10TChin) [10:52:37] (03CR) 10CI reject: [V:04-1] [PageViewInfo] Add new config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300892 (https://phabricator.wikimedia.org/T411771) (owner: 10TChin) [10:52:49] !log installing openssl security updates on bookworm [10:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2037: repool after upgrade [10:54:15] (03PS3) 10Clément Goubert: [PageViewInfo] Add new config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300892 (https://phabricator.wikimedia.org/T411771) (owner: 10TChin) [10:54:31] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw [10:55:05] (03CR) 10CI reject: [V:04-1] [PageViewInfo] Add new config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300892 (https://phabricator.wikimedia.org/T411771) (owner: 10TChin) [10:55:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw [10:58:21] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:58:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2036: Upgrading es2036.codfw.wmnet [10:59:07] (03PS4) 10Clément Goubert: [PageViewInfo] Add new config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300892 (https://phabricator.wikimedia.org/T411771) (owner: 10TChin) [10:59:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2036: Upgrading es2036.codfw.wmnet [11:00:07] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2036.codfw.wmnet with OS trixie [11:08:32] (03PS1) 10Marostegui: check_private_data_report: Add Ceri [puppet] - 10https://gerrit.wikimedia.org/r/1302123 [11:08:42] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad [11:09:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad [11:09:52] (03PS1) 10Mszwarc: Extract a service that initiates SI signal matching [extensions/CheckUser] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302124 (https://phabricator.wikimedia.org/T428557) [11:12:34] (03PS1) 10Slyngshede: P:idp allow services to require MFA [puppet] - 10https://gerrit.wikimedia.org/r/1302126 [11:12:51] (03PS1) 10Ayounsi: Makefike: don't try to install wheel*.whl [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1302127 [11:13:35] (03PS2) 10Ayounsi: Makefike: don't try to install wheel*.whl [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1302127 [11:13:50] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302126 (owner: 10Slyngshede) [11:17:16] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2036.codfw.wmnet with reason: host reimage [11:18:39] (03CR) 10Jelto: [C:03+2] aptrepo: update gitlab and gitlab-runner to version 19.0 [puppet] - 10https://gerrit.wikimedia.org/r/1302101 (https://phabricator.wikimedia.org/T426164) (owner: 10Jelto) [11:19:44] (03CR) 10Jcrespo: [C:03+2] admin: Add new systemctl alias and update $? output for jynus [puppet] - 10https://gerrit.wikimedia.org/r/1169640 (owner: 10Jcrespo) [11:21:34] (03PS2) 10Slyngshede: P:idp allow services to require MFA [puppet] - 10https://gerrit.wikimedia.org/r/1302126 [11:24:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2036.codfw.wmnet with reason: host reimage [11:26:59] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302126 (owner: 10Slyngshede) [11:27:37] 06SRE, 06Product Safety and Integrity: EtcdConfig failed to fetch data: (curl error: 28) Timeout was reached - https://phabricator.wikimedia.org/T429156#12018165 (10kostajh) [11:28:09] 06SRE, 06Product Safety and Integrity: EtcdConfig failed to fetch data: (curl error: 28) Timeout was reached - https://phabricator.wikimedia.org/T429156#12018171 (10kostajh) [11:30:05] (03CR) 10Elukey: [C:03+1] cache::haproxy: log txn.provenance variable for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1301432 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [11:37:55] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:38:54] (03CR) 10Slyngshede: "Enables webauthn on Puppetboard (test)" [puppet] - 10https://gerrit.wikimedia.org/r/1302126 (owner: 10Slyngshede) [11:41:01] elukey@cumin1003 provision (PID 4076610) is awaiting input [11:42:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2036.codfw.wmnet with OS trixie [11:43:00] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist extensions/Translate/scripts/ttmserver-export.php --ttmserver eqiad-k8s # T425377: populating translation memory (ttmserver-export.php) on eqiad-k8s (dblist: https://phabricator.wikimedia.org/P94127) [11:43:03] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:43:05] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [11:44:45] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist extensions/Translate/scripts/ttmserver-export.php --ttmserver codfw-k8s # T425377: populating translation memory (ttmserver-export.php) on codfw-k8s (dblist: https://phabricator.wikimedia.org/P94128) [11:45:28] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:45:47] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1251: Upgrading db1251.eqiad.wmnet [11:46:28] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1251: Upgrading db1251.eqiad.wmnet [11:48:05] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1251.eqiad.wmnet with OS trixie [11:49:25] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:49:46] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2216: Upgrading db2216.codfw.wmnet [11:50:08] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2216: Upgrading db2216.codfw.wmnet [11:53:20] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2216.codfw.wmnet with OS trixie [11:54:06] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:54:35] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2036: Migration of es2036.codfw.wmnet completed [11:55:59] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [11:56:46] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [11:58:07] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12018255 (10Jclark-ctr) Set all BMCs to DHCP, so they should now pick up the correct IPs. Verified the MAC addresses match in the BMC, NetBox, and on the server stickers. It's possible some of... [12:02:57] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1251.eqiad.wmnet with reason: host reimage [12:03:05] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team (Q4 FY2025-26): Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148#12018288 (10isarantopoulos) [12:04:29] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team (Q4 FY2025-26): Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148#12018298 (10isarantopoulos) I approve [12:05:41] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:06:08] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:06:36] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:06:55] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [12:09:59] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1251.eqiad.wmnet with reason: host reimage [12:10:51] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [12:11:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.17% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:11:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301401 (https://phabricator.wikimedia.org/T429038) (owner: 10Arlolra) [12:11:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301451 (https://phabricator.wikimedia.org/T398967) (owner: 10Arlolra) [12:12:41] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2216.codfw.wmnet with reason: host reimage [12:15:25] (03CR) 10JMeybohm: docker_registry: refactor the nginx config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1300746 (https://phabricator.wikimedia.org/T427175) (owner: 10Elukey) [12:15:32] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:16:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:18:13] (03CR) 10Urbanecm: "recheck" [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302100 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [12:18:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2216.codfw.wmnet with reason: host reimage [12:19:02] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1302135 (owner: 10L10n-bot) [12:21:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:46] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [12:23:43] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr1-eqiad:ae2 (asw2-b-eqiad:ae1) - https://phabricator.wikimedia.org/T429116#12018393 (10Jclark-ctr) a:03Jclark-ctr I believe this is the link causing the error: https://netbox.wikimedia.org/dcim/cables/5727/ @cmooney @ayounsi — can you confirm? If... [12:23:51] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [12:24:27] jouncebot: nowandnext [12:24:27] No deployments scheduled for the next 0 hour(s) and 35 minute(s) [12:24:27] In 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1300) [12:25:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302124 (https://phabricator.wikimedia.org/T428557) (owner: 10Mszwarc) [12:25:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302125 (https://phabricator.wikimedia.org/T428557) (owner: 10Mszwarc) [12:26:52] (03CR) 10Elukey: docker_registry: refactor the nginx config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1300746 (https://phabricator.wikimedia.org/T427175) (owner: 10Elukey) [12:26:52] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1251.eqiad.wmnet with OS trixie [12:27:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:29:11] (03PS4) 10Elukey: docker_registry: refactor the nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1300746 (https://phabricator.wikimedia.org/T427175) [12:30:35] (03PS5) 10Elukey: docker_registry: refactor the nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1300746 (https://phabricator.wikimedia.org/T427175) [12:30:38] (03Merged) 10jenkins-bot: Extract a service that initiates SI signal matching [extensions/CheckUser] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302124 (https://phabricator.wikimedia.org/T428557) (owner: 10Mszwarc) [12:30:40] (03Merged) 10jenkins-bot: Trigger Suggested Investigations when client hints are saved [extensions/CheckUser] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302125 (https://phabricator.wikimedia.org/T428557) (owner: 10Mszwarc) [12:31:00] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1302124|Extract a service that initiates SI signal matching (T428557)]], [[gerrit:1302125|Trigger Suggested Investigations when client hints are saved (T428557)]] [12:31:14] (03PS6) 10Elukey: docker_registry: refactor the nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1300746 (https://phabricator.wikimedia.org/T427175) [12:31:40] (03CR) 10Elukey: docker_registry: refactor the nginx config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1300746 (https://phabricator.wikimedia.org/T427175) (owner: 10Elukey) [12:31:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:32:48] !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1302124|Extract a service that initiates SI signal matching (T428557)]], [[gerrit:1302125|Trigger Suggested Investigations when client hints are saved (T428557)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:32:52] (03CR) 10Michael Große: "This might need Ia00ef48a745e21c416bc9db64705ef951d8ba976 to also be backported and then rebased on top of that." [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302100 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [12:32:57] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:32:59] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team (Q4 FY2025-26): Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148#12018419 (10Jdrewniak) As @HSwan-WMF's delegate, I approve. [12:34:22] !log mszwarc@deploy1003 mszwarc: Continuing with deployment [12:34:52] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1078.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:35:22] (03CR) 10Santiago Faci: [C:03+1] Update the Test Kitchen maintenance script to target testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209) (owner: 10Clare Ming) [12:35:51] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2216.codfw.wmnet with OS trixie [12:37:53] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1251: Migration of db1251.eqiad.wmnet completed [12:38:42] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302124|Extract a service that initiates SI signal matching (T428557)]], [[gerrit:1302125|Trigger Suggested Investigations when client hints are saved (T428557)]] (duration: 07m 42s) [12:38:47] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr1-eqiad:ae2 (asw2-b-eqiad:ae1) - https://phabricator.wikimedia.org/T429116#12018437 (10cmooney) >>! In T429116#12018393, @Jclark-ctr wrote: > I believe this is the link causing the error: https://netbox.wikimedia.org/dcim/cables/5727/ > > @cmooney... [12:40:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2036: Migration of es2036.codfw.wmnet completed [12:40:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:41:44] (03PS1) 10Muehlenhoff: Fix preseed for cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1302146 [12:42:03] (03Abandoned) 10Elukey: sre.hosts.reimage: test force_http_boot_once override [cookbooks] - 10https://gerrit.wikimedia.org/r/1291876 (owner: 10Elukey) [12:42:34] FIRING: SystemdUnitFailed: requestctl-credential-refresh.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:42:34] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129#12018455 (10Volans) It seems to be a mix of things, from one side sudo is executing without bash that sets TERM, with: ` sudo cumin 'A:puppetserver' 'echo $TERM' ` you get `dumb`... [12:42:54] (03CR) 10JMeybohm: [C:03+1] "Cool. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1300746 (https://phabricator.wikimedia.org/T427175) (owner: 10Elukey) [12:43:13] (03PS6) 10Elukey: sre.hosts.provision: introduce the wmfroot user [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) [12:43:33] (03CR) 10Elukey: "Tested with new nodes like kafka-logging and cloudvirt, all good afaics. Lemme know!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [12:43:47] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:44:15] (03PS1) 10Muehlenhoff: Allow cumin2003 in IRC notifications [puppet] - 10https://gerrit.wikimedia.org/r/1302147 (https://phabricator.wikimedia.org/T427897) [12:45:36] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1078.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:46:36] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2216: Migration of db2216.codfw.wmnet completed [12:48:40] !log installing augeas security updates [12:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:49] (03CR) 10Tiziano Fogli: [C:03+2] alertmanager-irc-relay: suppress alertname on irc [puppet] - 10https://gerrit.wikimedia.org/r/1301240 (https://phabricator.wikimedia.org/T424794) (owner: 10Tiziano Fogli) [12:50:03] 06SRE, 06Infrastructure-Foundations, 06Release-Engineering-Team (Radar), 07User-notice: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#12018526 (10Andrew) For future reference: it turned out that our osbpo mirror was one of two stores for those packcages, so we have set up a r... [12:50:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12018527 (10Jclark-ctr) @Marostegui has this been repooled yet? [12:50:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12018531 (10Marostegui) @Jclark-ctr nope, do you need anything from it? [12:51:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12018536 (10Jclark-ctr) No i am good. I was concerned of this being forgotten [12:54:09] (03PS1) 10Marostegui: Revert "db1262: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1302148 [12:55:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300858 (https://phabricator.wikimedia.org/T427730) (owner: 10Jforrester) [12:55:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300872 (owner: 10Jforrester) [12:56:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/WikiLambda] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301389 (https://phabricator.wikimedia.org/T428954) (owner: 10Jforrester) [12:57:53] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [12:57:56] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1300). [13:00:05] MatmaRex, Dragoniez, Sergi0, nemo-yiannis, and James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] * James_F waves. [13:00:26] 👋 [13:00:27] o/ [13:00:28] o/ [13:01:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:17] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [13:02:23] jouncebot: now [13:02:23] For the next 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1300) [13:02:36] MatmaRex: Can you deploy? [13:02:44] no [13:03:04] sorry i'm late. i need someone to ship my change :) i can also reschedule it, since it looks like we're quite busy [13:03:10] I can do it. [13:03:16] But I'll run out of time due to meetings. [13:03:37] (03CR) 10Elukey: [C:03+2] docker_registry: refactor the nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1300746 (https://phabricator.wikimedia.org/T427175) (owner: 10Elukey) [13:03:53] I'll do all the configs together. [13:04:17] (03PS2) 10Jforrester: abstractwiki: Temporary config for the automatic Abstract Article generation script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300872 [13:04:18] np [13:04:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293173 (https://phabricator.wikimedia.org/T412542) (owner: 10Bartosz Dziewoński) [13:04:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300873 (https://phabricator.wikimedia.org/T428942) (owner: 10Dragoniez) [13:04:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301401 (https://phabricator.wikimedia.org/T429038) (owner: 10Arlolra) [13:04:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300858 (https://phabricator.wikimedia.org/T427730) (owner: 10Jforrester) [13:04:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300872 (owner: 10Jforrester) [13:04:50] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12018596 (10MoritzMuehlenhoff) [13:04:51] (03CR) 10CI reject: [V:04-1] Configure wgOAuthAutoApprove['protocols'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293173 (https://phabricator.wikimedia.org/T412542) (owner: 10Bartosz Dziewoński) [13:05:00] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1079.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:05:05] I have time until the end of the window. My wmf.6 change is no-op and can go along with any other [13:05:16] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [13:05:22] (03PS4) 10Bartosz Dziewoński: Configure wgOAuthAutoApprove['protocols'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293173 (https://phabricator.wikimedia.org/T412542) [13:05:39] sergi0: Ack. [13:05:44] (03CR) 10JMeybohm: "I would test the binary on bookworm. If it works fine, copy it "down"" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1300145 (https://phabricator.wikimedia.org/T427403) (owner: 10Jelto) [13:05:46] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12018604 (10MoritzMuehlenhoff) [13:05:52] nemo-yiannis: Is your wmf.6 backport easy to test? [13:05:56] (03CR) 10TrainBranchBot: "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293173 (https://phabricator.wikimedia.org/T412542) (owner: 10Bartosz Dziewoński) [13:05:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300873 (https://phabricator.wikimedia.org/T428942) (owner: 10Dragoniez) [13:05:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301401 (https://phabricator.wikimedia.org/T429038) (owner: 10Arlolra) [13:05:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300858 (https://phabricator.wikimedia.org/T427730) (owner: 10Jforrester) [13:05:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300872 (owner: 10Jforrester) [13:06:25] James_F: yes [13:06:43] Cool, I'll merge the three wmf.6 ones together too. [13:06:50] (03PS1) 10Filippo Giunchedi: wmcs: do not pint-warn on NeutronAgentAdminDown [alerts] - 10https://gerrit.wikimedia.org/r/1302150 (https://phabricator.wikimedia.org/T328502) [13:06:52] (03PS1) 10Filippo Giunchedi: team-wmcs: introduce per-namespace neutron conntrack alert [alerts] - 10https://gerrit.wikimedia.org/r/1302151 (https://phabricator.wikimedia.org/T328502) [13:07:02] (03CR) 10Volans: [C:03+1] "LGTM cookbook wise, I'll leave it to o11y for the specifics" [cookbooks] - 10https://gerrit.wikimedia.org/r/1301397 (https://phabricator.wikimedia.org/T429020) (owner: 10Hnowlan) [13:07:09] (03PS2) 10Muehlenhoff: Fix preseed for cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1302146 [13:07:29] (03Merged) 10jenkins-bot: jawiki: remove four rights from the eliminator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300873 (https://phabricator.wikimedia.org/T428942) (owner: 10Dragoniez) [13:07:30] (03PS3) 10Muehlenhoff: Fix preseed for cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1302146 [13:07:33] (03Merged) 10jenkins-bot: Deploy PRV to 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301401 (https://phabricator.wikimedia.org/T429038) (owner: 10Arlolra) [13:07:50] (03Abandoned) 10Arnaudb: cache_text: add gitlab-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1296572 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [13:08:05] (03Abandoned) 10Arnaudb: service: add gitlab-https and gitlab-ssh service to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1290684 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [13:08:11] (03Abandoned) 10Arnaudb: lvs7003: add gitlab-ssh and gitlab-https [puppet] - 10https://gerrit.wikimedia.org/r/1291898 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [13:08:16] (03PS4) 10Muehlenhoff: Remove cumin1002 from preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1302146 [13:08:17] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [13:08:42] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [13:10:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/50 (Core: cr2-esams:et-0/0/0 {#30369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-bw27-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:10:34] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [13:10:42] * James_F sighs at CI being slow. [13:10:45] Perfect timing. [13:10:52] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:et-0/0/0 (Core: asw1-bw27-esams:et-0/0/50 {#30369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:10:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr2-esams (185.15.59.158) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:10:58] !log installing Linux 6.1.174 on Bookworm hosts [13:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:24] (03Merged) 10jenkins-bot: [abstractwiki] Set wgForceUIMsgAsContentMsg for sidebar messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300858 (https://phabricator.wikimedia.org/T427730) (owner: 10Jforrester) [13:11:28] (03Merged) 10jenkins-bot: abstractwiki: Temporary config for the automatic Abstract Article generation script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300872 (owner: 10Jforrester) [13:11:36] (03Merged) 10jenkins-bot: Configure wgOAuthAutoApprove['protocols'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293173 (https://phabricator.wikimedia.org/T412542) (owner: 10Bartosz Dziewoński) [13:11:56] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1293173|Configure wgOAuthAutoApprove['protocols'] (T412542 T426614)]], [[gerrit:1300873|jawiki: remove four rights from the eliminator group (T428942)]], [[gerrit:1301401|Deploy PRV to 6 wikis (T429038)]], [[gerrit:1300858|[abstractwiki] Set wgForceUIMsgAsContentMsg for sidebar messages (T427730)]], [[gerrit:1300872|abstractwiki: Temporary config fo [13:11:56] r the automatic Abstract Article generation script]] [13:12:09] T412542: Rethink protocol support for OAuth apps - https://phabricator.wikimedia.org/T412542 [13:12:10] T426614: add "CommonsFinder://" custom scheme to $wgUrlProtocols for native app OAuth2 support - https://phabricator.wikimedia.org/T426614 [13:12:11] T428942: Changing eliminator settings on jawiki - https://phabricator.wikimedia.org/T428942 [13:12:11] T429038: Parsoid Read Views to deploy ~2026-06-15 - https://phabricator.wikimedia.org/T429038 [13:12:11] (03CR) 10Tiziano Fogli: [C:03+2] netops/iface/saturation: strip cableid [alerts] - 10https://gerrit.wikimedia.org/r/1301236 (https://phabricator.wikimedia.org/T424794) (owner: 10Tiziano Fogli) [13:12:12] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1079.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:12:13] T427730: Make MediaWiki:wikilambda-abstractwiki-sidebar-projectchat and MediaWiki:wikilambda-abstractwiki-sidebar-createarticle translatable - https://phabricator.wikimedia.org/T427730 [13:12:21] (03CR) 10Tiziano Fogli: [C:03+2] netops/iface/saturation: suppress alertname on irc [alerts] - 10https://gerrit.wikimedia.org/r/1301241 (https://phabricator.wikimedia.org/T424794) (owner: 10Tiziano Fogli) [13:12:44] (03CR) 10Jforrester: [C:03+2] CacheTesterResultsJob: Re-hydrate stashedResult to stdClass [extensions/WikiLambda] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301389 (https://phabricator.wikimedia.org/T428954) (owner: 10Jforrester) [13:12:52] (03CR) 10Jforrester: [C:03+2] Store nowiki source in StripState::extra to support subst-nowiki [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301451 (https://phabricator.wikimedia.org/T398967) (owner: 10Arlolra) [13:13:14] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1079.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:13:20] (03CR) 10Muehlenhoff: [C:03+2] Remove cumin1002 from preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1302146 (owner: 10Muehlenhoff) [13:13:25] sergi0: 1302100 fails CI; MichaelG_WMF says you might also need 1300107 backported? Can you look at that? [13:13:42] !log jforrester@deploy1003 arlolra, matmarex, jforrester, dragoniez: Backport for [[gerrit:1293173|Configure wgOAuthAutoApprove['protocols'] (T412542 T426614)]], [[gerrit:1300873|jawiki: remove four rights from the eliminator group (T428942)]], [[gerrit:1301401|Deploy PRV to 6 wikis (T429038)]], [[gerrit:1300858|[abstractwiki] Set wgForceUIMsgAsContentMsg for sidebar messages (T427730)]], [[gerrit:1300872|abstractwiki: Te [13:13:42] mporary config for the automatic Abstract Article generation script]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:13:59] (03CR) 10Jforrester: [C:03+2] Remove no longer used product_metrics.homepage_module_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300835 (https://phabricator.wikimedia.org/T365889) (owner: 10Sergio Gimeno) [13:14:29] Everyone please check mw-debug. [13:14:29] (03PS10) 10Arnaudb: trafficserver: add a map for gitlab instances as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) [13:15:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1300748 (owner: 10Effie Mouzeli) [13:15:19] oh sorry about that, on it [13:15:25] (03PS1) 10Sergio Gimeno: TaskSuggester: avoid nullable logger in setLogger call [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302153 [13:15:57] (03PS2) 10Sergio Gimeno: migrateMentorStatusAway: ensure validateStrictly receives objects [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302100 (https://phabricator.wikimedia.org/T409170) [13:16:16] James_F: Looking good for my patch, TY [13:16:35] still testing, one sec [13:16:38] Ack. [13:17:04] nemo-yiannis: Is the PRV change OK? [13:17:30] (looks good) [13:17:45] yes just verified it [13:17:48] Cool. [13:17:50] !log jforrester@deploy1003 arlolra, matmarex, jforrester, dragoniez: Continuing with deployment [13:17:52] plwiki is rendered by default with parsoid [13:17:56] (03Merged) 10jenkins-bot: Remove no longer used product_metrics.homepage_module_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300835 (https://phabricator.wikimedia.org/T365889) (owner: 10Sergio Gimeno) [13:18:07] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1079.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:18:57] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1080.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:20:13] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1283043 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [13:20:14] nemo-yiannis: Sadly your patch has suffered flaky tests. :-( Will re-trigger it. [13:20:25] (03Merged) 10jenkins-bot: CacheTesterResultsJob: Re-hydrate stashedResult to stdClass [extensions/WikiLambda] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301389 (https://phabricator.wikimedia.org/T428954) (owner: 10Jforrester) [13:20:37] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [13:21:55] (03PS1) 10Giuseppe Lavagetto: Several changes: [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1302157 [13:22:06] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293173|Configure wgOAuthAutoApprove['protocols'] (T412542 T426614)]], [[gerrit:1300873|jawiki: remove four rights from the eliminator group (T428942)]], [[gerrit:1301401|Deploy PRV to 6 wikis (T429038)]], [[gerrit:1300858|[abstractwiki] Set wgForceUIMsgAsContentMsg for sidebar messages (T427730)]], [[gerrit:1300872|abstractwiki: Temporary config f [13:22:06] or the automatic Abstract Article generation script]] (duration: 10m 10s) [13:22:16] T412542: Rethink protocol support for OAuth apps - https://phabricator.wikimedia.org/T412542 [13:22:17] T426614: add "CommonsFinder://" custom scheme to $wgUrlProtocols for native app OAuth2 support - https://phabricator.wikimedia.org/T426614 [13:22:17] T428942: Changing eliminator settings on jawiki - https://phabricator.wikimedia.org/T428942 [13:22:17] T429038: Parsoid Read Views to deploy ~2026-06-15 - https://phabricator.wikimedia.org/T429038 [13:22:18] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Several changes: [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1302157 (owner: 10Giuseppe Lavagetto) [13:22:19] T427730: Make MediaWiki:wikilambda-abstractwiki-sidebar-projectchat and MediaWiki:wikilambda-abstractwiki-sidebar-createarticle translatable - https://phabricator.wikimedia.org/T427730 [13:22:21] (03CR) 10CI reject: [V:04-1] Store nowiki source in StripState::extra to support subst-nowiki [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301451 (https://phabricator.wikimedia.org/T398967) (owner: 10Arlolra) [13:22:33] (03CR) 10Jforrester: [C:03+2] "…" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301451 (https://phabricator.wikimedia.org/T398967) (owner: 10Arlolra) [13:23:01] (03CR) 10Jforrester: [C:03+2] TaskSuggester: avoid nullable logger in setLogger call [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302153 (owner: 10Sergio Gimeno) [13:23:04] (03CR) 10Jforrester: [C:03+2] migrateMentorStatusAway: ensure validateStrictly receives objects [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302100 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [13:23:23] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1251: Migration of db1251.eqiad.wmnet completed [13:23:25] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [13:24:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302153 (owner: 10Sergio Gimeno) [13:24:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302100 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [13:24:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301451 (https://phabricator.wikimedia.org/T398967) (owner: 10Arlolra) [13:24:58] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Haproxy provenance maps in HP; UX changes - oblivian@cumin1003" [13:25:00] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Haproxy provenance maps in HP; UX changes - oblivian@cumin1003 [13:25:16] !log cr2-esams, reconfigure chassis fpc to set port 0 to 100G T427056 [13:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:58] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Haproxy provenance maps in HP; UX changes - oblivian@cumin1003 [13:25:59] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Haproxy provenance maps in HP; UX changes - oblivian@cumin1003" [13:27:32] 06SRE, 06Infrastructure-Foundations, 06Traffic: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175 (10ssingh) 03NEW [13:27:46] 06SRE, 06Infrastructure-Foundations, 06Traffic: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12018820 (10ssingh) p:05Triage→03Medium [13:28:25] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1080.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:28:33] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr2-esams,cr2-esams IPv6 with reason: bouncing pic0 to reconfigure port speeds [13:29:47] !log enable BGP graceful-shutdown sender on cr2-esams to drain traffic T427056 [13:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:47] (03CR) 10Hnowlan: [C:03+2] thumbor: change readiness probes to make surge recovery safer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298811 (https://phabricator.wikimedia.org/T357145) (owner: 10Hnowlan) [13:32:06] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2216: Migration of db2216.codfw.wmnet completed [13:32:07] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [13:33:01] (03PS1) 10Elukey: sre.hosts.reimage: introduce wmfroot [cookbooks] - 10https://gerrit.wikimedia.org/r/1302160 (https://phabricator.wikimedia.org/T426180) [13:33:04] James_F: Are you waiting for the CI to pass ? [13:33:16] Yeah. :-( [13:33:20] ok [13:33:28] You can watch along at https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-vendor-mysql-php83/40866/console [13:34:05] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1006.eqiad.wmnet with OS trixie [13:34:14] (03Merged) 10jenkins-bot: thumbor: change readiness probes to make surge recovery safer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298811 (https://phabricator.wikimedia.org/T357145) (owner: 10Hnowlan) [13:34:58] (03Merged) 10jenkins-bot: Store nowiki source in StripState::extra to support subst-nowiki [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1301451 (https://phabricator.wikimedia.org/T398967) (owner: 10Arlolra) [13:35:03] (03Merged) 10jenkins-bot: TaskSuggester: avoid nullable logger in setLogger call [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302153 (owner: 10Sergio Gimeno) [13:35:04] (03Merged) 10jenkins-bot: migrateMentorStatusAway: ensure validateStrictly receives objects [extensions/GrowthExperiments] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302100 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [13:35:26] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1300835|Remove no longer used product_metrics.homepage_module_interaction (T365889 T426742)]], [[gerrit:1302153|TaskSuggester: avoid nullable logger in setLogger call]], [[gerrit:1302100|migrateMentorStatusAway: ensure validateStrictly receives objects (T409170)]], [[gerrit:1301451|Store nowiki source in StripState::extra to support subst-nowiki (T3 [13:35:26] 98967)]], [[gerrit:1301389|CacheTesterResultsJob: Re-hydrate stashedResult to stdClass (T428954)]] [13:35:38] T365889: [EPIC] Community updates module: instrumentation & measurement (SDS 2.1.3) - https://phabricator.wikimedia.org/T365889 [13:35:39] T426742: Remove client side analytics logging from Special:Homepage and modules - https://phabricator.wikimedia.org/T426742 [13:35:40] T409170: Run MigrateMentorStatusAway migration script - https://phabricator.wikimedia.org/T409170 [13:35:40] T428954: TypeError: MediaWiki\Extension\WikiLambda\Jobs\CacheTesterResultsJob::storeTestResult(): Argument #10 ($stashedResult) must be of type stdClass, array given, called in /srv/mediawiki/php-1.47.0-wmf.6/extensions/WikiLambda/inclu - https://phabricator.wikimedia.org/T428954 [13:37:13] !log jforrester@deploy1003 arlolra, sgimeno, jforrester: Backport for [[gerrit:1300835|Remove no longer used product_metrics.homepage_module_interaction (T365889 T426742)]], [[gerrit:1302153|TaskSuggester: avoid nullable logger in setLogger call]], [[gerrit:1302100|migrateMentorStatusAway: ensure validateStrictly receives objects (T409170)]], [[gerrit:1301451|Store nowiki source in StripState::extra to support subst-nowik [13:37:13] i (T398967)]], [[gerrit:1301389|CacheTesterResultsJob: Re-hydrate stashedResult to stdClass (T428954)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:37:21] T398967: Parsoid doesn't process Template:Markup correctly on cbk_zamwiki - https://phabricator.wikimedia.org/T398967 [13:37:30] nemo-yiannis: Can you verify? [13:37:39] on it [13:37:54] sergi0: Ditto. :-) [13:38:09] works [13:38:13] thanks James_F [13:38:24] lgtm from my end [13:39:07] !log jforrester@deploy1003 arlolra, sgimeno, jforrester: Continuing with deployment [13:40:42] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host conf2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:41:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12018955 (10Marostegui) Yeah, no worries. Will be pooled tomorrow or later today. Thanks - will close the task once repooled. [13:42:16] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host conf2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:43:01] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#12018960 (10elukey) @Jhancock.wm Hi! I am testing a new version of the provision cookbook to unblock this use case, do you mind to send to me the... [13:43:22] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300835|Remove no longer used product_metrics.homepage_module_interaction (T365889 T426742)]], [[gerrit:1302153|TaskSuggester: avoid nullable logger in setLogger call]], [[gerrit:1302100|migrateMentorStatusAway: ensure validateStrictly receives objects (T409170)]], [[gerrit:1301451|Store nowiki source in StripState::extra to support subst-nowiki (T [13:43:22] 398967)]], [[gerrit:1301389|CacheTesterResultsJob: Re-hydrate stashedResult to stdClass (T428954)]] (duration: 07m 56s) [13:43:30] T365889: [EPIC] Community updates module: instrumentation & measurement (SDS 2.1.3) - https://phabricator.wikimedia.org/T365889 [13:43:30] T426742: Remove client side analytics logging from Special:Homepage and modules - https://phabricator.wikimedia.org/T426742 [13:43:31] T409170: Run MigrateMentorStatusAway migration script - https://phabricator.wikimedia.org/T409170 [13:43:31] T428954: TypeError: MediaWiki\Extension\WikiLambda\Jobs\CacheTesterResultsJob::storeTestResult(): Argument #10 ($stashedResult) must be of type stdClass, array given, called in /srv/mediawiki/php-1.47.0-wmf.6/extensions/WikiLambda/inclu - https://phabricator.wikimedia.org/T428954 [13:43:51] All done. [13:44:04] thanks [13:44:44] FIRING: [2x] SystemdUnitFailed: requestctl-credential-refresh.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:51] (03PS2) 10Filippo Giunchedi: team-wmcs: do not pint-warn on NeutronAgentAdminDown [alerts] - 10https://gerrit.wikimedia.org/r/1302150 (https://phabricator.wikimedia.org/T328502) [13:45:51] (03PS2) 10Filippo Giunchedi: team-wmcs: introduce per-namespace neutron conntrack alert [alerts] - 10https://gerrit.wikimedia.org/r/1302151 (https://phabricator.wikimedia.org/T328502) [13:46:21] thank you @James_F ! [13:47:07] (03CR) 10Majavah: [C:04-1] team-wmcs: introduce per-namespace neutron conntrack alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1302151 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [13:49:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Reboots T426633 [13:49:28] (03CR) 10Ssingh: "Looks good; should we also consider updating modules/profile/files/trafficserver/multi-dc_test.lua?" [puppet] - 10https://gerrit.wikimedia.org/r/1298383 (https://phabricator.wikimedia.org/T208443) (owner: 10Gergő Tisza) [13:49:34] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [13:49:54] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 11 hosts with reason: Reboots T426633 [13:50:29] (03CR) 10Majavah: [C:03+1] team-wmcs: do not pint-warn on NeutronAgentAdminDown [alerts] - 10https://gerrit.wikimedia.org/r/1302150 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [13:51:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Reboots T426633 [13:51:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Reboots T426633 [13:51:50] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#12019027 (10elukey) [13:52:34] FIRING: [3x] SystemdUnitFailed: requestctl-credential-refresh.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:50] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for Our World in Data gadget updates - https://phabricator.wikimedia.org/T429131#12019043 (10Doc_James) Thanks Aklapper owid@lists.wikimedia.org for updates regarding improvements in OWID visualizations needing implementation by interface admins in various langua... [13:53:01] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12019045 (10elukey) [13:53:02] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#12019046 (10elukey) [13:53:55] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist extensions/Translate/scripts/ttmserver-export.php --ttmserver codfw-k8s # T425377: populating translation memory (ttmserver-export.php) on codfw-k8s (dblist: https://phabricator.wikimedia.org/P94145) [13:53:59] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [13:54:05] (03CR) 10Ssingh: "Yes that makes sense. How about we just put the specific CSP in hiera and then leave reporting-endpoint in VCL? It's not the most ideal bu" [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [13:54:12] !log doing a quick restart of sanitarium hosts db1155 and db1154 [13:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300245 (https://phabricator.wikimedia.org/T422756) (owner: 10BPirkle) [13:55:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#12019069 (10elukey) I tried to reimage and got "No root file system" when partitioning in d-i. This is the current dev list from the d-i's shell: ` ~ # ls /dev/sd* /dev/sd... [13:56:17] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1006.eqiad.wmnet with OS trixie [13:56:45] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cloudvirt1077.eqiad.wmnet with OS trixie [13:57:18] (03PS2) 10Filippo Giunchedi: openstack: deprecate ensure_running_kvm_instances check [puppet] - 10https://gerrit.wikimedia.org/r/1302114 (https://phabricator.wikimedia.org/T328502) [13:57:18] (03PS1) 10Filippo Giunchedi: prometheus: remove 'key' label from neutron_netns metrics [puppet] - 10https://gerrit.wikimedia.org/r/1302164 (https://phabricator.wikimedia.org/T328502) [13:57:34] FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:29] (03CR) 10Majavah: [C:03+1] prometheus: remove 'key' label from neutron_netns metrics [puppet] - 10https://gerrit.wikimedia.org/r/1302164 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [13:58:54] (03CR) 10Majavah: openstack: deprecate ensure_running_kvm_instances check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1302114 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [14:00:06] (03CR) 10CDanis: "How about `sess` instead? Would save us re-computing it on every request." [puppet] - 10https://gerrit.wikimedia.org/r/1301431 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [14:00:16] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:00:28] (03PS1) 10Giuseppe Lavagetto: Revert "Several changes:" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1302166 [14:00:36] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Revert "Several changes:" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1302166 (owner: 10Giuseppe Lavagetto) [14:00:36] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1196: Upgrading db1196.eqiad.wmnet [14:01:26] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1196: Upgrading db1196.eqiad.wmnet [14:01:36] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [14:01:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:01:54] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "revert deployment - oblivian@cumin1003" [14:01:55] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: revert deployment - oblivian@cumin1003 [14:02:48] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: revert deployment - oblivian@cumin1003 [14:02:49] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "revert deployment - oblivian@cumin1003" [14:03:32] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1196.eqiad.wmnet with OS trixie [14:04:18] jouncebot: nowandnext [14:04:19] No deployments scheduled for the next 0 hour(s) and 25 minute(s) [14:04:19] In 0 hour(s) and 25 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1430) [14:04:58] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [14:05:05] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:05:24] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:06:45] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:07:38] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1077.eqiad.wmnet with reason: host reimage [14:07:41] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1077.eqiad.wmnet with reason: host reimage [14:07:57] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:08:29] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:10:48] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1302167 (https://phabricator.wikimedia.org/T429190) [14:11:54] 06SRE, 10Maps, 06Traffic: Possibility to allow Wikimedia Maps usage on all Wikibase Cloud instances - https://phabricator.wikimedia.org/T429191 (10Anton.Kokh) 03NEW [14:12:57] 06SRE, 10homer, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Homer should abort on filter rules applied on non-existent or disabled interfaces - https://phabricator.wikimedia.org/T428886#12019220 (10taavi) I'm not a huge fan of relying on the exact formatting of the interface description,... [14:14:32] 06SRE, 10Maps, 06Traffic: Possibility to allow Wikimedia Maps usage on all Wikibase Cloud instances - https://phabricator.wikimedia.org/T429191#12019226 (10ssingh) Hi @Anton.Kokh: Please see https://wikitech.wikimedia.org/wiki/Maps/External_usage on how to structure this request. Once that is done, this then... [14:14:51] (03CR) 10Pppery: Localisation updates from https://translatewiki.net. (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1302135 (owner: 10L10n-bot) [14:17:45] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1196.eqiad.wmnet with reason: host reimage [14:19:47] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for Our World in Data gadget updates - https://phabricator.wikimedia.org/T429131#12019266 (10Aklapper) 05Stalled→03Open [14:20:40] (03CR) 10Tiziano Fogli: [C:03+2] sre: Add sre.metamonitoring.downtime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1301397 (https://phabricator.wikimedia.org/T429020) (owner: 10Hnowlan) [14:21:38] 06SRE, 10Maps, 06Traffic: Possibility to allow Wikimedia Maps usage on all Wikibase Cloud instances - https://phabricator.wikimedia.org/T429191#12019273 (10Anton.Kokh) [14:23:06] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1196.eqiad.wmnet with reason: host reimage [14:24:02] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [14:24:28] !log elukey@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: tesT [14:25:02] 06SRE, 10Maps, 06Traffic: Possibility to allow Wikimedia Maps usage on all Wikibase Cloud instances - https://phabricator.wikimedia.org/T429191#12019296 (10Anton.Kokh) This structure doesn't exactly apply to our case, because we are a wiki farm, not a specific project, but here goes: **Link to site**: all s... [14:26:54] (03PS1) 10Kosta Harlan: NoReferrerLinks: Add rel=noreferrer noopener for configured domains [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302169 (https://phabricator.wikimedia.org/T429090) [14:27:07] elukey@cumin1003 reimage (PID 4119135) is awaiting input [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1430) [14:30:24] (03CR) 10CWilliams: [C:03+1] check_private_data_report: Add Ceri [puppet] - 10https://gerrit.wikimedia.org/r/1302123 (owner: 10Marostegui) [14:30:34] (03CR) 10Marostegui: [C:03+2] check_private_data_report: Add Ceri [puppet] - 10https://gerrit.wikimedia.org/r/1302123 (owner: 10Marostegui) [14:30:48] (03CR) 10Marostegui: [C:03+2] Revert "db1262: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1302148 (owner: 10Marostegui) [14:31:09] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [14:31:09] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1077.eqiad.wmnet with OS trixie [14:32:00] (03CR) 10CWilliams: [C:03+1] "Looks good enough to begin with to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) (owner: 10Federico Ceratto) [14:32:53] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.upgrade: add basic functional test [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) (owner: 10Federico Ceratto) [14:33:30] (03CR) 10Federico Ceratto: [C:03+2] "Done" [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) (owner: 10Federico Ceratto) [14:33:35] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] sre.mysql.upgrade: add basic functional test [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) (owner: 10Federico Ceratto) [14:34:48] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:37:34] FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:41:25] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1196.eqiad.wmnet with OS trixie [14:41:43] I'm going to backport a patch for T429090. Any objections? [14:41:43] T429090: Add "noreferrer" to the "rel" attribute for links leading to archive.today or one of its mirrors - https://phabricator.wikimedia.org/T429090 [14:41:46] 10SRE-swift-storage, 06Commons, 06DBA, 10media-backups, and 2 others: old file revisions missing of File:A_Warm_Shade_of_Ivory_-_Henry_Mancini_album_cover.jpg - https://phabricator.wikimedia.org/T428406#12019595 (10Ladsgroup) @Zabe since this is enwiki and we are read new there. Maybe it's related to the f... [14:42:34] FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:02] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12019633 (10Marostegui) @Jclark-ctr I just realised that this host restarted itself around 12 hours ago. There're no HW logs that I can see but it got restarted at around 2AM UTC time, and... [14:45:16] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [14:45:17] (03PS1) 10Marostegui: Revert^2 "db1262: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1302175 [14:45:33] Msz2001: I think it's fair [14:45:44] luckily it adds no i18n [14:45:49] jouncebot: nowandnext [14:45:49] For the next 0 hour(s) and 14 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1430) [14:45:49] In 0 hour(s) and 44 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1530) [14:46:23] (03CR) 10Marostegui: [C:03+2] Revert^2 "db1262: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1302175 (owner: 10Marostegui) [14:46:42] Reedy: Yeah, thanks. In Slack it was suggested to ask for CTT review first so I'll do that (on Slack) before backporting [14:48:40] (03CR) 10Cathal Mooney: Makefike: don't try to install wheel*.whl (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1302127 (owner: 10Ayounsi) [14:49:39] FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:06] (03CR) 10Reedy: Makefike: don't try to install wheel*.whl (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1302127 (owner: 10Ayounsi) [14:50:25] (03PS1) 10Btullis: Deploy the new version of the ceph-csi plugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302177 (https://phabricator.wikimedia.org/T428385) [14:52:16] 06SRE, 06ServiceOps new: Build httpbb for Trixie - https://phabricator.wikimedia.org/T427899#12019693 (10MLechvien-WMF) @RLazarus do you target to do it this quarter? [14:52:57] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr2-esams,cr2-esams IPv6 with reason: bouncing pic0 to reconfigure port speeds [14:54:08] !log enable BGP graceful-shutdown sender on cr2-esams to drain traffic T427056 [14:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:24] (03CR) 10Tiziano Fogli: [C:03+1] "Tested on pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/1301356 (https://phabricator.wikimedia.org/T429020) (owner: 10Hnowlan) [14:55:15] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#12019747 (10Jhancock.wm) @elukey sent you the email [14:55:55] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1196: Migration of db1196.eqiad.wmnet completed [14:56:28] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:57:38] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [14:57:43] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host conf2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:58:15] (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring: add downtime support [puppet] - 10https://gerrit.wikimedia.org/r/1301356 (https://phabricator.wikimedia.org/T429020) (owner: 10Hnowlan) [14:58:27] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host conf2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:59:40] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:59:42] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:00:06] (03Abandoned) 10Elukey: sre.hosts.reimage: use ADMIN for redfish when reimaging Supermicro hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1285868 (owner: 10Elukey) [15:00:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:00:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-by27-esams and cr2-esams (185.15.59.150) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:00:49] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [15:01:38] (03CR) 10Ladsgroup: [C:03+1] Replace wgNewUserMessageOnAutoCreate with wgNewUserMessageOnFirstEdit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299626 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [15:01:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:01:54] !log cmooney@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool esams [reason: no reason specified, no task ID specified] [15:02:10] (03PS1) 10Gerrit maintenance bot: Add nyn to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1302180 (https://phabricator.wikimedia.org/T429189) [15:02:28] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:02:43] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool esams [reason: no reason specified, no task ID specified] [15:02:57] !log depool esams due to cr2-esams rpd crash [15:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:40] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:03:42] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:03:54] (03CR) 10Ladsgroup: "Ahmon: Do you want me to merge it now? I can do it" [puppet] - 10https://gerrit.wikimedia.org/r/1300914 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [15:04:17] (03CR) 10Ahmon Dancy: "yes please!" [puppet] - 10https://gerrit.wikimedia.org/r/1300914 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [15:05:07] (03PS3) 10Ahmon Dancy: profile::mariadb::beta: Initialize system schema on fresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1300914 (https://phabricator.wikimedia.org/T428930) [15:05:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:05:10] (03CR) 10Ladsgroup: [C:03+2] profile::mariadb::beta: Initialize system schema on fresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1300914 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [15:05:13] (03CR) 10Ladsgroup: [V:03+2 C:03+2] profile::mariadb::beta: Initialize system schema on fresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1300914 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [15:05:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between asw1-by27-esams and cr2-esams (185.15.59.150) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:06:12] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "done \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1300914 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [15:09:03] (03PS1) 10Ahmon Dancy: Revert "beta: Add deployment-db15 to db-labs config at weight 0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302184 (https://phabricator.wikimedia.org/T429099) [15:09:25] (03PS2) 10Ahmon Dancy: Revert "beta: Add deployment-db15 to db-labs config at weight 0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302184 (https://phabricator.wikimedia.org/T429099) [15:09:57] jouncebot nowandnext [15:09:57] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [15:09:57] In 0 hour(s) and 20 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1530) [15:10:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302184 (https://phabricator.wikimedia.org/T429099) (owner: 10Ahmon Dancy) [15:11:37] (03Merged) 10jenkins-bot: Revert "beta: Add deployment-db15 to db-labs config at weight 0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302184 (https://phabricator.wikimedia.org/T429099) (owner: 10Ahmon Dancy) [15:12:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:13:24] (03PS1) 10Hnowlan: prometheus: use dc label in appservers_red reporting rules [puppet] - 10https://gerrit.wikimedia.org/r/1302185 (https://phabricator.wikimedia.org/T249663) [15:13:57] !log cmooney@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool esams [reason: no reason specified, no task ID specified] [15:15:48] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool esams [reason: no reason specified, no task ID specified] [15:16:06] !log repool esams following cr2-esams rpd crash [15:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:53] (03CR) 10FNegri: [C:03+2] toolsdb: automatically terminate idle transactions [puppet] - 10https://gerrit.wikimedia.org/r/1298835 (https://phabricator.wikimedia.org/T409857) (owner: 10FNegri) [15:17:48] !log cwilliams@cumin1003 START - Cookbook sre.hosts.remove-downtime for an-redacteddb1001.eqiad.wmnet [15:17:49] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-redacteddb1001.eqiad.wmnet [15:18:02] !log cwilliams@cumin1003 START - Cookbook sre.hosts.remove-downtime for 11 hosts [15:18:08] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 11 hosts [15:18:48] (03CR) 10Fabfur: "IIUC the `log-format` for variables accepts only `txn` context, but I may be wrong..." [puppet] - 10https://gerrit.wikimedia.org/r/1301431 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [15:18:59] !log cwilliams@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1154.eqiad.wmnet [15:19:00] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1154.eqiad.wmnet [15:19:06] !log cwilliams@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1155.eqiad.wmnet [15:19:06] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1155.eqiad.wmnet [15:23:59] (03CR) 10Ahmon Dancy: [C:03+1] "This is currently live in beta cluster. I'd like to get it merged for codesearchability." [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) (owner: 10BryanDavis) [15:24:42] RECOVERY - OSPF status on cr1-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:25:47] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: es2045 down - https://phabricator.wikimedia.org/T429113#12020032 (10Jhancock.wm) a:03Jhancock.wm @Marostegui did a cold reboot and BIOS update. did an idrac update too to get better data later. Looks like it resolved the issue, but if it happens again, let us know. t... [15:25:52] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:26:25] (03PS1) 10Reedy: Bump guzzlehttp/psr to version 2.11.0 [vendor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302186 (https://phabricator.wikimedia.org/T429208) [15:26:36] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr2-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:27:47] (03CR) 10Elukey: [C:03+1] "Left a note but it seems ok to me!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1302127 (owner: 10Ayounsi) [15:27:49] (03CR) 10Dzahn: [C:03+2] Weekly Phab data for Tech News: Remove extra whitespace from table output [puppet] - 10https://gerrit.wikimedia.org/r/1301628 (https://phabricator.wikimedia.org/T428290) (owner: 10Neriah) [15:29:57] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host conf2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1530). [15:30:52] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:31:38] (03PS1) 10Cathal Mooney: cr2-esams: change ospf interface for peering to cr1-esams [homer/public] - 10https://gerrit.wikimedia.org/r/1302191 (https://phabricator.wikimedia.org/T428199) [15:32:32] (03Abandoned) 10CDobbins: trying out `alias` to get rid of redundancy [puppet] - 10https://gerrit.wikimedia.org/r/1297769 (owner: 10CDobbins) [15:32:46] jouncebot: nowandnext [15:32:46] For the next 0 hour(s) and 27 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1530) [15:32:46] In 1 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1700) [15:32:46] In 1 hour(s) and 27 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1700) [15:32:49] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: es2045 down - https://phabricator.wikimedia.org/T429113#12020088 (10Jhancock.wm) i did find this from when we first got the server. same error and i think it was the same process that preceded it. T381549 [15:33:17] (03PS1) 10Dreamy Jazz: SourceEditorOverlayHookPayload: Allow aborting of the save [extensions/MobileFrontend] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302192 (https://phabricator.wikimedia.org/T428287) [15:33:27] elukey@cumin1003 provision (PID 4179717) is awaiting input [15:33:31] (03PS1) 10Dreamy Jazz: hCaptcha MobileFrontend: Avoid indefinite save loop on known errors [extensions/ConfirmEdit] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302194 (https://phabricator.wikimedia.org/T428287) [15:34:00] (03PS1) 10Reedy: OATHUserRepository: Specify caller in query [extensions/OATHAuth] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302195 [15:34:06] Any objection to me using scap? [15:34:48] fancy putting a few more patches through? :P [15:34:56] Sure [15:35:00] (03CR) 10Cathal Mooney: [C:03+2] cr2-esams: change ospf interface for peering to cr1-esams [homer/public] - 10https://gerrit.wikimedia.org/r/1302191 (https://phabricator.wikimedia.org/T428199) (owner: 10Cathal Mooney) [15:35:16] (03PS2) 10Dreamy Jazz: hCaptcha MobileFrontend: Avoid indefinite save loop on known errors [extensions/ConfirmEdit] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302194 (https://phabricator.wikimedia.org/T428287) [15:35:29] Is it just https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OATHAuth/+/1302195? [15:35:36] (03PS1) 10Dzahn: langlist: add 'nyn' - Nyankole language [dns] - 10https://gerrit.wikimedia.org/r/1302196 (https://phabricator.wikimedia.org/T429189) [15:35:37] the vendor patch too [15:36:32] (03Merged) 10jenkins-bot: cr2-esams: change ospf interface for peering to cr1-esams [homer/public] - 10https://gerrit.wikimedia.org/r/1302191 (https://phabricator.wikimedia.org/T428199) (owner: 10Cathal Mooney) [15:36:48] (03CR) 10Dzahn: [C:03+2] langlist: add 'nyn' - Nyankole language [dns] - 10https://gerrit.wikimedia.org/r/1302196 (https://phabricator.wikimedia.org/T429189) (owner: 10Dzahn) [15:36:53] !log dzahn@dns1006 START - running authdns-update [15:36:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302192 (https://phabricator.wikimedia.org/T428287) (owner: 10Dreamy Jazz) [15:36:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302194 (https://phabricator.wikimedia.org/T428287) (owner: 10Dreamy Jazz) [15:36:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/OATHAuth] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302195 (owner: 10Reedy) [15:36:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302186 (https://phabricator.wikimedia.org/T429208) (owner: 10Reedy) [15:37:04] Doing both then :D [15:37:10] (03CR) 10Ladsgroup: "We automatically create dns patches https://gerrit.wikimedia.org/r/c/operations/dns/+/1302180/" [dns] - 10https://gerrit.wikimedia.org/r/1302196 (https://phabricator.wikimedia.org/T429189) (owner: 10Dzahn) [15:37:11] ty [15:38:05] (03CR) 10Ladsgroup: "post-merge -1. The automatic patch was created before this patch." [dns] - 10https://gerrit.wikimedia.org/r/1302196 (https://phabricator.wikimedia.org/T429189) (owner: 10Dzahn) [15:38:31] (03CR) 10Dzahn: [C:03+2] "ok :( too late for this but I will stop doing this in the future. It is a bit sad though." [dns] - 10https://gerrit.wikimedia.org/r/1302196 (https://phabricator.wikimedia.org/T429189) (owner: 10Dzahn) [15:40:11] !log dzahn@dns1006 END - running authdns-update [15:41:04] !log added new project language 'nyn' - Bantu language spoken by the Nkore and Hema peoples of Southwestern Uganda [15:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:25] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1196: Migration of db1196.eqiad.wmnet completed [15:41:26] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [15:43:43] (03CR) 10CI reject: [V:04-1] SourceEditorOverlayHookPayload: Allow aborting of the save [extensions/MobileFrontend] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302192 (https://phabricator.wikimedia.org/T428287) (owner: 10Dreamy Jazz) [15:44:27] (03CR) 10Dzahn: [C:03+2] releases: mask tmp.mount [puppet] - 10https://gerrit.wikimedia.org/r/1301400 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [15:44:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302192 (https://phabricator.wikimedia.org/T428287) (owner: 10Dreamy Jazz) [15:44:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302194 (https://phabricator.wikimedia.org/T428287) (owner: 10Dreamy Jazz) [15:44:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/OATHAuth] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302195 (owner: 10Reedy) [15:44:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302186 (https://phabricator.wikimedia.org/T429208) (owner: 10Reedy) [15:45:52] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:48:13] (03CR) 10Anzx: [C:04-1] "duplicate of Ie4bf7d8368c7a56748" [dns] - 10https://gerrit.wikimedia.org/r/1302180 (https://phabricator.wikimedia.org/T429189) (owner: 10Gerrit maintenance bot) [15:49:08] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host conf2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:49:27] (03CR) 10Dzahn: [C:03+2] "arrr.. once again something that should be simple is not." [puppet] - 10https://gerrit.wikimedia.org/r/1301400 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [15:49:44] (03Merged) 10jenkins-bot: hCaptcha MobileFrontend: Avoid indefinite save loop on known errors [extensions/ConfirmEdit] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302194 (https://phabricator.wikimedia.org/T428287) (owner: 10Dreamy Jazz) [15:49:47] (03Merged) 10jenkins-bot: OATHUserRepository: Specify caller in query [extensions/OATHAuth] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302195 (owner: 10Reedy) [15:50:07] (03Merged) 10jenkins-bot: Bump guzzlehttp/psr to version 2.11.0 [vendor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302186 (https://phabricator.wikimedia.org/T429208) (owner: 10Reedy) [15:50:20] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host conf2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:50:24] (03CR) 10Gergő Tisza: "It was already not testing all of the possible URLs before (authenticate was missing + only one variant of each endpoint was tested), and " [puppet] - 10https://gerrit.wikimedia.org/r/1298383 (https://phabricator.wikimedia.org/T208443) (owner: 10Gergő Tisza) [15:50:47] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases1003.eqiad.wmnet with reason: puppet debugging [15:51:14] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases2003.codfw.wmnet with reason: puppet debugging [15:51:25] (03Abandoned) 10Dzahn: Add nyn to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1302180 (https://phabricator.wikimedia.org/T429189) (owner: 10Gerrit maintenance bot) [15:51:34] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host conf2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:55:06] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12020198 (10elukey) @Jclark-ctr all provisioned! I reimaged cloudvirt1077 and it is ready now, but cloudvirt1078 seems missing basic network setup on Netbox. Anything missing on that side? [15:56:35] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: es2045 down - https://phabricator.wikimedia.org/T429113#12020216 (10Marostegui) >>! In T429113#12020088, @Jhancock.wm wrote: > i did find this from when we first got the server. same error and i think it was the same or similar process that preceded it. T381549 would y... [15:57:39] (03CR) 10C. Scott Ananian: [C:03+1] "Should work." [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302169 (https://phabricator.wikimedia.org/T429090) (owner: 10Kosta Harlan) [15:57:45] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host conf2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:58:03] (03CR) 10FNegri: sre.mysql.upgrade: fix looping logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291999 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [15:58:14] jouncebot: nowandnext [15:58:14] For the next 0 hour(s) and 1 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1530) [15:58:14] In 1 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1700) [15:58:15] In 1 hour(s) and 1 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1700) [15:58:29] I'm currently waiting for backport merges [15:58:36] I could stop and include that other one? [15:58:41] Dreamy_Jazz: can we include https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaCustomizations/+/1302169 as well? [15:59:24] Otherwise, I’ll do it when you’re done [15:59:24] (03Abandoned) 10FNegri: sre.mysql.upgrade: fix looping logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1291999 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [15:59:52] I have a PrivateSettings patch to sync after that anyway [16:00:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302192 (https://phabricator.wikimedia.org/T428287) (owner: 10Dreamy Jazz) [16:00:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302169 (https://phabricator.wikimedia.org/T429090) (owner: 10Kosta Harlan) [16:03:27] (03Merged) 10jenkins-bot: SourceEditorOverlayHookPayload: Allow aborting of the save [extensions/MobileFrontend] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302192 (https://phabricator.wikimedia.org/T428287) (owner: 10Dreamy Jazz) [16:04:09] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host conf2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:04:39] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host conf2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:04:58] (03CR) 10Atsuko: [C:03+2] toolhub: switch prod to prod opensearch cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301358 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [16:05:12] (03Merged) 10jenkins-bot: NoReferrerLinks: Add rel=noreferrer noopener for configured domains [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302169 (https://phabricator.wikimedia.org/T429090) (owner: 10Kosta Harlan) [16:05:37] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1302192|SourceEditorOverlayHookPayload: Allow aborting of the save (T428287)]], [[gerrit:1302194|hCaptcha MobileFrontend: Avoid indefinite save loop on known errors (T428287)]], [[gerrit:1302195|OATHUserRepository: Specify caller in query]], [[gerrit:1302186|Bump guzzlehttp/psr to version 2.11.0 (T429208)]], [[gerrit:1302169|NoReferrerLinks: Add rel [16:05:37] =noreferrer noopener for configured domains (T429090)]] [16:05:42] T428287: hCaptcha MobileFrontend: Indefinite loop of save requests if request always fails with "known" error - https://phabricator.wikimedia.org/T428287 [16:05:43] T429208: guzzlehttp/psr7 security advisories < 2.10.2 - https://phabricator.wikimedia.org/T429208 [16:05:43] T429090: Add "noreferrer" to the "rel" attribute for links leading to archive.today or one of its mirrors - https://phabricator.wikimedia.org/T429090 [16:06:15] (03PS1) 10MSantos: Disable parser survey for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302201 [16:07:11] (03Merged) 10jenkins-bot: toolhub: switch prod to prod opensearch cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301358 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [16:07:24] !log dreamyjazz@deploy1003 reedy, dreamyjazz, kharlan: Backport for [[gerrit:1302192|SourceEditorOverlayHookPayload: Allow aborting of the save (T428287)]], [[gerrit:1302194|hCaptcha MobileFrontend: Avoid indefinite save loop on known errors (T428287)]], [[gerrit:1302195|OATHUserRepository: Specify caller in query]], [[gerrit:1302186|Bump guzzlehttp/psr to version 2.11.0 (T429208)]], [[gerrit:1302169|NoReferrerLinks: Add [16:07:24] rel=noreferrer noopener for configured domains (T429090)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:07:34] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:07:40] reedy: Anything to test for yours? [16:07:48] not for me :) [16:08:38] !log atsuko@deploy1003 helmfile [eqiad] START helmfile.d/services/toolhub: apply [16:08:39] (03PS1) 10Ahmon Dancy: beta: Replace deployment-db14 with deployment-db15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302202 (https://phabricator.wikimedia.org/T429099) [16:08:52] !log dreamyjazz@deploy1003 reedy, dreamyjazz, kharlan: Continuing with deployment [16:08:58] !log atsuko@deploy1003 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [16:10:27] (03CR) 10Jdlrobson: [C:03+1] Disable parser survey for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302201 (owner: 10MSantos) [16:10:45] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host conf2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:13:04] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:13:14] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302192|SourceEditorOverlayHookPayload: Allow aborting of the save (T428287)]], [[gerrit:1302194|hCaptcha MobileFrontend: Avoid indefinite save loop on known errors (T428287)]], [[gerrit:1302195|OATHUserRepository: Specify caller in query]], [[gerrit:1302186|Bump guzzlehttp/psr to version 2.11.0 (T429208)]], [[gerrit:1302169|NoReferrerLinks: Add re [16:13:14] l=noreferrer noopener for configured domains (T429090)]] (duration: 07m 37s) [16:13:19] T428287: hCaptcha MobileFrontend: Indefinite loop of save requests if request always fails with "known" error - https://phabricator.wikimedia.org/T428287 [16:13:20] T429208: guzzlehttp/psr7 security advisories < 2.10.2 - https://phabricator.wikimedia.org/T429208 [16:13:20] T429090: Add "noreferrer" to the "rel" attribute for links leading to archive.today or one of its mirrors - https://phabricator.wikimedia.org/T429090 [16:13:29] kostajh: Your turn [16:13:53] kostajh: Would you mind adding 1302202 to your deployment? [16:13:54] (I'll enable MF for group2 later today once I'm sure the patch I backported is stable) [16:14:11] Kosta is doing private code deploy [16:14:32] Though I guess making the private code change and then running scap on that could work? [16:14:45] dancy: you can go ahead with your sync [16:14:50] I’m in a meeting and can’t focus on the private change now [16:14:53] thanks! It will be brief.. beta-only [16:15:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302202 (https://phabricator.wikimedia.org/T429099) (owner: 10Ahmon Dancy) [16:16:05] !log atsuko@deploy1003 helmfile [codfw] START helmfile.d/services/toolhub: apply [16:16:22] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1302126 (owner: 10Slyngshede) [16:16:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:16:34] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:16:46] !log atsuko@deploy1003 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [16:17:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#12020376 (10elukey) @colewhite hi! I think that the partman recipe is not working for kafka-logging1006, not sure what is the problem, but the debian installer complains ab... [16:17:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:20:58] (03CR) 10Slyngshede: "q" [puppet] - 10https://gerrit.wikimedia.org/r/1302126 (owner: 10Slyngshede) [16:23:26] (03CR) 10CI reject: [V:04-1] beta: Replace deployment-db14 with deployment-db15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302202 (https://phabricator.wikimedia.org/T429099) (owner: 10Ahmon Dancy) [16:24:41] (03CR) 10Ahmon Dancy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302202 (https://phabricator.wikimedia.org/T429099) (owner: 10Ahmon Dancy) [16:24:49] (03CR) 10Muehlenhoff: [C:03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/1302126 (owner: 10Slyngshede) [16:27:11] (03PS1) 10Dzahn: systemd: add a new data type for a systemd mount [puppet] - 10https://gerrit.wikimedia.org/r/1302205 (https://phabricator.wikimedia.org/T418299) [16:28:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 12.46% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:28:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:29:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302202 (https://phabricator.wikimedia.org/T429099) (owner: 10Ahmon Dancy) [16:29:41] (03CR) 10Ahmon Dancy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302202 (https://phabricator.wikimedia.org/T429099) (owner: 10Ahmon Dancy) [16:29:51] (03CR) 10CI reject: [V:04-1] systemd: add a new data type for a systemd mount [puppet] - 10https://gerrit.wikimedia.org/r/1302205 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [16:29:57] FIRING: [4x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:31:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:31:18] (03Merged) 10jenkins-bot: beta: Replace deployment-db14 with deployment-db15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302202 (https://phabricator.wikimedia.org/T429099) (owner: 10Ahmon Dancy) [16:31:59] (03PS1) 10Dzahn: systemd: add a new data type for a systemd mount [puppet] - 10https://gerrit.wikimedia.org/r/1302206 (https://phabricator.wikimedia.org/T418299) [16:32:22] (03PS2) 10Dzahn: systemd: add a new data type for a systemd mount [puppet] - 10https://gerrit.wikimedia.org/r/1302205 (https://phabricator.wikimedia.org/T418299) [16:33:06] (03PS3) 10Dzahn: systemd: add a new data type for a systemd mount [puppet] - 10https://gerrit.wikimedia.org/r/1302205 (https://phabricator.wikimedia.org/T418299) [16:33:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 18.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:34:25] (03CR) 10Ilias Sarantopoulos: [C:03+1] "just closing the unresolved comment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298101 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [16:34:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:57] RESOLVED: [4x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:36:02] !log atsuko@deploy1003 helmfile [staging] START helmfile.d/services/toolhub: apply [16:36:13] !log atsuko@deploy1003 helmfile [staging] DONE helmfile.d/services/toolhub: apply [16:36:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.107s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:38:53] (03PS2) 10Dzahn: systemd: add a new data type for a systemd mount [puppet] - 10https://gerrit.wikimedia.org/r/1302206 (https://phabricator.wikimedia.org/T418299) [16:40:50] (03PS3) 10Dzahn: systemd: add a new data type for a systemd mount [puppet] - 10https://gerrit.wikimedia.org/r/1302206 (https://phabricator.wikimedia.org/T418299) [16:41:11] (03CR) 10CWilliams: "@fceratto@wikimedia.org is there a reason that spicerack built-in functionality is not being used here, e.g. https://gerrit.wikimedia.org/" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [16:42:10] (03Abandoned) 10Dzahn: systemd: add a new data type for a systemd mount [puppet] - 10https://gerrit.wikimedia.org/r/1302205 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [16:42:45] (03CR) 10CWilliams: "I see new code doing the work of existing code, e.g. https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1277076/14/cookbooks/sre/mysq" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [16:46:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298875 (https://phabricator.wikimedia.org/T423148) (owner: 10Kimberly Sarabia) [16:46:35] (03CR) 10Dzahn: [C:03+2] "need https://gerrit.wikimedia.org/r/c/operations/puppet/+/1302206 or similar" [puppet] - 10https://gerrit.wikimedia.org/r/1301400 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1700) [17:00:05] ryankemper: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T1700). [17:00:51] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: es2045 down - https://phabricator.wikimedia.org/T429113#12020531 (10Jhancock.wm) i dug a little more into the errors. it's most likely a firmware issue than anything else. The error doesn't usually indicate that the CPU has an error, but that it caught an error in a pro... [17:03:47] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [17:07:34] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5026.eqsin.wmnet with OS trixie [17:08:35] (03CR) 10Scott French: "Thanks, Fabrizio!" [puppet] - 10https://gerrit.wikimedia.org/r/1301429 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [17:08:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:09:47] !log brett@cumin2002 START - Cookbook sre.hosts.move-vlan for host cp5026 [17:11:05] (03CR) 10Dzahn: "I would like to go ahead with this and amend later if there is something to amend because otherwise I need to revert the other change and " [puppet] - 10https://gerrit.wikimedia.org/r/1302206 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [17:12:50] brett@cumin2002 reimage (PID 3302130) is awaiting input [17:13:40] (03PS1) 10BCornwall: common: Update cp5026's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1302209 (https://phabricator.wikimedia.org/T428229) [17:13:43] (03CR) 10Scott French: [C:04-1] "Thanks, Fabrizio!" [puppet] - 10https://gerrit.wikimedia.org/r/1301431 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [17:20:22] FIRING: CertAlmostExpired: gNMI TLS certificate for lsw1-c4-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:31:54] (03PS3) 10Dzahn: gerrit: add logrotate for httpd with custom log location [puppet] - 10https://gerrit.wikimedia.org/r/1301462 (https://phabricator.wikimedia.org/T425667) [17:34:27] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-c4-eqiad [17:34:27] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c4-eqiad [17:35:59] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-c4-eqiad [17:35:59] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c4-eqiad [17:39:26] (03CR) 10Dzahn: "the content of the .erb file is like a copy of the default one we use for apache logs, just a different path" [puppet] - 10https://gerrit.wikimedia.org/r/1301462 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [17:40:22] FIRING: [2x] CertAlmostExpired: gNMI TLS certificate for lsw1-c4-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:40:26] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device ssw1-d8-eqiad [17:40:26] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-d8-eqiad [17:42:02] (03PS1) 10Jforrester: [WIP] Periodic jobs: Add abstractwiki_update_generated_articles [puppet] - 10https://gerrit.wikimedia.org/r/1302213 (https://phabricator.wikimedia.org/T422628) [17:42:41] (03CR) 10CDobbins: [C:03+1] common: Update cp5026's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1302209 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [17:42:51] (03CR) 10BCornwall: [C:03+2] common: Update cp5026's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1302209 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [17:45:22] FIRING: [3x] CertAlmostExpired: gNMI TLS certificate for lsw1-c4-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:46:34] !log brett@cumin2002 START - Cookbook sre.dns.netbox [17:47:46] (03CR) 10Jforrester: "check puppet" [puppet] - 10https://gerrit.wikimedia.org/r/1302213 (https://phabricator.wikimedia.org/T422628) (owner: 10Jforrester) [17:50:22] FIRING: [8x] CertAlmostExpired: gNMI TLS certificate for lsw1-c3-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:51:15] (03CR) 10Scott French: [C:03+1] k8s: add wikikube-worker2331 [puppet] - 10https://gerrit.wikimedia.org/r/1289022 (https://phabricator.wikimedia.org/T426688) (owner: 10Jasmine) [17:52:15] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cp5026 - brett@cumin2002" [17:52:25] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cp5026 - brett@cumin2002" [17:52:25] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:52:26] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache cp5026.eqsin.wmnet 37.0.132.10.in-addr.arpa 7.3.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [17:52:29] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cp5026.eqsin.wmnet 37.0.132.10.in-addr.arpa 7.3.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [17:52:30] !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5026 [17:53:19] !log brett@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5026 [17:53:19] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cp5026 [17:55:22] FIRING: [13x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:57:03] (03CR) 10BCornwall: "I'd argue that using hiera at all is just overcomplicating things - instead of changing it in some abstraction, just change it in the VCL." [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [17:57:26] (03CR) 10BCornwall: [C:03+1] Depool puppetserver2002 for rack maintenance [dns] - 10https://gerrit.wikimedia.org/r/1300766 (https://phabricator.wikimedia.org/T428020) (owner: 10Muehlenhoff) [17:57:43] (03CR) 10BCornwall: [C:03+1] Add new control plane wikikube-ctrl1005 to etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1300942 (https://phabricator.wikimedia.org/T418920) (owner: 10Jasmine) [17:58:33] 06SRE, 06Infrastructure-Foundations, 10netops: Network device tls certs: alerting niggles - https://phabricator.wikimedia.org/T429242 (10cmooney) 03NEW p:05Triage→03Medium [18:00:22] FIRING: CertAlmostExpired: gNMI TLS certificate for lsw1-d6-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:01:14] ^^^ folks I'll look at these cert renewals tomorrow. current ones are good until end of July, but renewal cookbook only works if they have less than 4 weeks left. [18:01:23] T429242 [18:01:23] T429242: Network device tls certs: alerting niggles - https://phabricator.wikimedia.org/T429242 [18:02:22] (03CR) 10BCornwall: [C:03+1] systemd: add a new data type for a systemd mount [puppet] - 10https://gerrit.wikimedia.org/r/1302206 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [18:08:00] (03CR) 10Dzahn: [C:03+2] systemd: add a new data type for a systemd mount [puppet] - 10https://gerrit.wikimedia.org/r/1302206 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [18:10:07] (03CR) 10Dzahn: [C:03+2] "what" [dns] - 10https://gerrit.wikimedia.org/r/1302196 (https://phabricator.wikimedia.org/T429189) (owner: 10Dzahn) [18:11:42] (03PS1) 10Ahmon Dancy: beta: Add replica provisioning scripts [puppet] - 10https://gerrit.wikimedia.org/r/1302224 (https://phabricator.wikimedia.org/T428930) [18:13:11] (03PS2) 10Ahmon Dancy: beta: Add replica provisioning scripts [puppet] - 10https://gerrit.wikimedia.org/r/1302224 (https://phabricator.wikimedia.org/T428930) [18:15:18] (03CR) 10CI reject: [V:04-1] beta: Add replica provisioning scripts [puppet] - 10https://gerrit.wikimedia.org/r/1302224 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [18:16:45] (03PS3) 10Ahmon Dancy: beta: Add replica provisioning scripts [puppet] - 10https://gerrit.wikimedia.org/r/1302224 (https://phabricator.wikimedia.org/T428930) [18:18:05] !log releases2003 - systemctl stop tmp.mount [18:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:26] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage [18:27:27] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs5005.eqsin.wmnet} and A:liberica [18:27:52] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs5005.eqsin.wmnet} and A:liberica [18:34:48] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:35:00] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage [18:37:03] (03CR) 10Dzahn: "sorry, I am not comfortable merging since I have no relation to this or the context" [puppet] - 10https://gerrit.wikimedia.org/r/1302224 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [18:38:04] (03CR) 10Ahmon Dancy: "No problem. I will ask Amir." [puppet] - 10https://gerrit.wikimedia.org/r/1302224 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [18:39:11] (03CR) 10Ahmon Dancy: "Amir, these are the scripts that I used to provision deployment-db15 and will be using to provision deployment-db16 shortly. I'll be upda" [puppet] - 10https://gerrit.wikimedia.org/r/1302224 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [18:39:20] (03CR) 10Dzahn: "going ahead - already tested this by manually adding the same config and running logrotate directly once" [puppet] - 10https://gerrit.wikimedia.org/r/1301462 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [18:39:59] (03CR) 10Ssingh: "No, you are right. Let's go ahead -- let me know when we should merge this." [puppet] - 10https://gerrit.wikimedia.org/r/1298383 (https://phabricator.wikimedia.org/T208443) (owner: 10Gergő Tisza) [18:40:07] (03PS1) 10BCornwall: Create sre.cdn.roll-restart-purged [cookbooks] - 10https://gerrit.wikimedia.org/r/1302230 [18:40:28] (03CR) 10Dzahn: [C:03+2] gerrit: add logrotate for httpd with custom log location [puppet] - 10https://gerrit.wikimedia.org/r/1301462 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [18:42:16] (03PS1) 10Medelius: Enable "exit the editor" survey on 11 wikis for phase 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302232 (https://phabricator.wikimedia.org/T426132) [18:42:23] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-purged rolling restart_daemons on P{cp7001.magru.wmnet} and A:cp [18:43:46] (03CR) 10Ssingh: "This is mostly for @cdobbins@wikimedia.org to decide but I wanted to separate the CSP since other teams (PSI) may also need to edit it and" [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [18:44:02] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-purged (exit_code=0) rolling restart_daemons on P{cp7001.magru.wmnet} and A:cp [18:45:16] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [18:45:32] 06SRE, 06Infrastructure-Foundations, 06Traffic: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12020934 (10ssingh) [Adding @cmooney and @ayounsi for the anycast bits]. [18:45:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302232 (https://phabricator.wikimedia.org/T426132) (owner: 10Medelius) [18:51:38] (03CR) 10JHathaway: [C:03+1] Blocklisting more unused packet mangling/network scheduler modules [puppet] - 10https://gerrit.wikimedia.org/r/1301368 (owner: 10Muehlenhoff) [18:52:34] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:56:08] (03PS1) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [18:56:31] (03PS1) 10Dzahn: gerrit: fix syntax for logrotate template content [puppet] - 10https://gerrit.wikimedia.org/r/1302238 (https://phabricator.wikimedia.org/T425667) [18:56:42] (03CR) 10CI reject: [V:04-1] cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [18:56:56] (03CR) 10Dzahn: [C:03+2] gerrit: fix syntax for logrotate template content [puppet] - 10https://gerrit.wikimedia.org/r/1302238 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [18:57:39] (03PS2) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [18:58:12] (03CR) 10CI reject: [V:04-1] cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [19:01:19] (03PS3) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [19:01:58] (03CR) 10CI reject: [V:04-1] cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [19:04:06] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5026.eqsin.wmnet with OS trixie [19:04:49] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs5005.eqsin.wmnet} and A:liberica [19:05:02] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs5005.eqsin.wmnet} and A:liberica [19:05:23] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5026.* [19:06:51] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp5026.* [19:14:55] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs5004.eqsin.wmnet} and A:liberica [19:15:19] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs5004.eqsin.wmnet} and A:liberica [19:16:53] !log brett@cumin2002 START - Cookbook sre.loadbalancer.upgrade restart P{lvs5005.eqsin.wmnet} and A:liberica [19:17:52] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restart P{lvs5005.eqsin.wmnet} and A:liberica [19:18:19] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5026.* [19:21:08] !log brett@cumin2002 START - Cookbook sre.loadbalancer.upgrade restart A:liberica-eqsin [19:23:55] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restart A:liberica-eqsin [19:25:53] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5028.eqsin.wmnet with OS trixie [19:26:25] !log brett@cumin2002 START - Cookbook sre.hosts.move-vlan for host cp5028 [19:27:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:27:49] (03PS1) 10BCornwall: common: Update cp5028's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1302249 (https://phabricator.wikimedia.org/T428229) [19:29:28] brett@cumin2002 reimage (PID 3331383) is awaiting input [19:31:57] (03CR) 10Ssingh: [C:03+1] common: Update cp5028's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1302249 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [19:33:06] (03PS4) 10Ahmon Dancy: beta: Add replica provisioning scripts [puppet] - 10https://gerrit.wikimedia.org/r/1302224 (https://phabricator.wikimedia.org/T428930) [19:33:06] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp3066.esams.wmnet [19:33:10] (03CR) 10Ladsgroup: [V:03+2 C:03+2] beta: Add replica provisioning scripts [puppet] - 10https://gerrit.wikimedia.org/r/1302224 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [19:33:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:33:52] (03CR) 10BCornwall: [C:03+2] common: Update cp5028's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1302249 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [19:33:57] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3066.esams.wmnet [19:33:57] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5026.* [19:34:28] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp3067.esams.wmnet [19:35:01] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3067.esams.wmnet [19:36:20] !log brett@cumin2002 START - Cookbook sre.dns.netbox [19:39:40] (03CR) 10Esanders: [C:03+1] Enable "exit the editor" survey on 11 wikis for phase 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302232 (https://phabricator.wikimedia.org/T426132) (owner: 10Medelius) [19:40:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-wdqs2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:42:55] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cp5028 - brett@cumin2002" [19:43:01] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cp5028 - brett@cumin2002" [19:43:01] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:43:01] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache cp5028.eqsin.wmnet 25.0.132.10.in-addr.arpa 5.2.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [19:43:05] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cp5028.eqsin.wmnet 25.0.132.10.in-addr.arpa 5.2.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [19:43:06] !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5028 [19:44:02] !log brett@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5028 [19:44:03] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cp5028 [19:44:17] 06SRE, 06ServiceOps new: Build httpbb for Trixie - https://phabricator.wikimedia.org/T427899#12021070 (10RLazarus) Yes, planning it for this week. [19:44:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs2001.codfw.wmnet with OS trixie [19:44:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12021074 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhanco... [19:46:58] FIRING: ProbeDown: Service upload:80 has failed probes (http_upload_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#upload:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:48:53] (03PS4) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [19:51:58] RESOLVED: ProbeDown: Service upload:80 has failed probes (http_upload_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#upload:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:52:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [19:55:33] jhancock@cumin2002 reimage (PID 3336215) is awaiting input [19:57:23] (03PS1) 10Ahmon Dancy: beta: Fix modules/beta/files/receive_replica.sh [puppet] - 10https://gerrit.wikimedia.org/r/1302253 (https://phabricator.wikimedia.org/T428930) [19:58:27] (03CR) 10Ahmon Dancy: "Followup to Ib96b0b3ddfe639b717b52e149e07ecc9d6a7f1af" [puppet] - 10https://gerrit.wikimedia.org/r/1302253 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [19:59:03] (03PS5) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [19:59:16] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [19:59:25] (03CR) 10Ladsgroup: [C:03+2] beta: Fix modules/beta/files/receive_replica.sh [puppet] - 10https://gerrit.wikimedia.org/r/1302253 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T2000). [20:00:05] bpirkle and cmede: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] I'm here [20:00:26] o/ me too [20:02:45] (03CR) 10Ladsgroup: "We have been doing this automatically for almost six years now: https://gitlab.wikimedia.org/ladsgroup/Phabricator-maintenance-bot/-/commi" [dns] - 10https://gerrit.wikimedia.org/r/1302196 (https://phabricator.wikimedia.org/T429189) (owner: 10Dzahn) [20:02:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs2001.codfw.wmnet with OS trixie [20:03:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12021195 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@c... [20:06:00] would anyone be able to help me deploy mine? bpirkle, will you be deploying yours yourself? [20:06:29] I'd like help too. I checked earlier and realized I never requested spiderpig permissions. [20:07:25] I can help with deployments today. [20:07:41] cmede, bpirkle: Can your changes go out together? [20:07:41] wahoo thank you [20:07:53] i'm fine with it if they are [20:07:58] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-privatedata-users" for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T428416#12021218 (10BCornwall) @karapayneWMDE We're still waiting on your sign-off! [20:07:58] I'm good with that. Mine is a no-op in preparation for future core changes [20:08:04] ok.. pressing the button [20:08:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300245 (https://phabricator.wikimedia.org/T422756) (owner: 10BPirkle) [20:08:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302232 (https://phabricator.wikimedia.org/T426132) (owner: 10Medelius) [20:09:20] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EChukwukere-WMF - https://phabricator.wikimedia.org/T428827#12021222 (10BCornwall) Hi, @EChukwukere-WMF, we're still waiting on your reply for your needs. After you've determined that we can go about moving forward. Thanks! [20:09:46] (03Merged) 10jenkins-bot: REST: set new RestModuleOverrides variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300245 (https://phabricator.wikimedia.org/T422756) (owner: 10BPirkle) [20:09:50] (03Merged) 10jenkins-bot: Enable "exit the editor" survey on 11 wikis for phase 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302232 (https://phabricator.wikimedia.org/T426132) (owner: 10Medelius) [20:10:05] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1300245|REST: set new RestModuleOverrides variable (T422756)]], [[gerrit:1302232|Enable "exit the editor" survey on 11 wikis for phase 2 (T426132)]] [20:10:12] T422756: REST: Audience Designations - add RestModuleOverrides config value - https://phabricator.wikimedia.org/T422756 [20:10:13] T426132: Deploy config change to start "Exit the editor" survey (Phase 2) - https://phabricator.wikimedia.org/T426132 [20:12:26] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team (Q4 FY2025-26): Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148#12021240 (10BCornwall) p:05Triage→03Medium [20:14:09] !log dancy@deploy1003 caro, dancy, bpirkle: Backport for [[gerrit:1300245|REST: set new RestModuleOverrides variable (T422756)]], [[gerrit:1302232|Enable "exit the editor" survey on 11 wikis for phase 2 (T426132)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:14:18] checking [20:14:26] checking [20:15:43] Looks good [20:16:29] same for me! [20:16:33] (03PS35) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [20:16:40] Ok. Proceeding [20:16:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:44] !log dancy@deploy1003 caro, dancy, bpirkle: Continuing with deployment [20:17:02] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5028.eqsin.wmnet with reason: host reimage [20:18:16] (03PS1) 10Ahmon Dancy: beta: Add deployment-db16 as a second read replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302257 (https://phabricator.wikimedia.org/T429245) [20:19:01] (03PS6) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [20:19:37] (03CR) 10CI reject: [V:04-1] cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [20:21:00] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300245|REST: set new RestModuleOverrides variable (T422756)]], [[gerrit:1302232|Enable "exit the editor" survey on 11 wikis for phase 2 (T426132)]] (duration: 10m 54s) [20:21:05] T422756: REST: Audience Designations - add RestModuleOverrides config value - https://phabricator.wikimedia.org/T422756 [20:21:06] T426132: Deploy config change to start "Exit the editor" survey (Phase 2) - https://phabricator.wikimedia.org/T426132 [20:21:45] (03PS7) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [20:23:21] sweet, thank you dancy [20:23:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302257 (https://phabricator.wikimedia.org/T429245) (owner: 10Ahmon Dancy) [20:23:28] cmede: yw [20:23:31] Yep, thank you! [20:24:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [20:24:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5028.eqsin.wmnet with reason: host reimage [20:24:44] (03Merged) 10jenkins-bot: beta: Add deployment-db16 as a second read replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302257 (https://phabricator.wikimedia.org/T429245) (owner: 10Ahmon Dancy) [20:26:46] FIRING: CalicoTyphaDown: Too few (0) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown [20:28:25] (03CR) 10CDobbins: "upload: https://puppet-compiler.wmflabs.org/output/1297217/8736/" [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [20:28:59] (03PS8) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [20:29:10] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [20:29:33] FIRING: [17x] KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:30:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [20:31:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:34:33] FIRING: [16x] KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:35:50] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [20:38:43] (03PS9) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [20:39:33] FIRING: [17x] KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:40:01] FIRING: [8x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [20:42:16] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [20:44:33] FIRING: [17x] KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:44:48] FIRING: [17x] KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:45:01] FIRING: [16x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [20:46:16] (03PS10) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [20:46:26] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [20:48:42] (03PS1) 10JHathaway: puppet-merge: disable colors if we don't have a tty [puppet] - 10https://gerrit.wikimedia.org/r/1302262 (https://phabricator.wikimedia.org/T429129) [20:49:33] FIRING: [17x] KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:50:01] FIRING: [36x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [20:50:02] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Terminal configuration for cookbooks - https://phabricator.wikimedia.org/T429129#12021336 (10jhathaway) Since we don't have a PTY, I would disable colors. Patch sent! [20:50:50] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [20:51:25] (03PS11) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [20:51:30] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [20:51:46] RESOLVED: CalicoTyphaDown: Too few (0) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown [20:52:20] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [20:52:25] (03PS1) 10Ahmon Dancy: beta: Promote deployment-db15 to master, drop deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302263 (https://phabricator.wikimedia.org/T428910) [20:52:25] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5028.eqsin.wmnet with OS trixie [20:54:33] FIRING: [17x] KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:55:01] FIRING: [36x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [20:57:20] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [20:59:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302263 (https://phabricator.wikimedia.org/T428910) (owner: 10Ahmon Dancy) [21:00:05] alexsanford, Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T2100). [21:00:34] (03Merged) 10jenkins-bot: beta: Promote deployment-db15 to master, drop deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302263 (https://phabricator.wikimedia.org/T428910) (owner: 10Ahmon Dancy) [21:01:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:05:10] !log brett@cumin2002 START - Cookbook sre.loadbalancer.upgrade restart P{lvs5005.eqsin.wmnet} and A:liberica [21:06:09] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restart P{lvs5005.eqsin.wmnet} and A:liberica [21:06:28] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5028.* [21:06:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EChukwukere-WMF - https://phabricator.wikimedia.org/T428827#12021436 (10EChukwukere-WMF) @RLazarus and @BCornwall, I believe I will not be needing the SSH access or Kerberos. I only need to be added to the analytics-privatedata-... [21:07:22] Hey all - we have a few security deployments to do today. Let me know if you’re still deploying :) [21:08:52] (03PS4) 10Daniel Kinzler: smokepy tests: share helm-test pod via vendor module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298850 (https://phabricator.wikimedia.org/T424825) [21:09:47] (03PS1) 10SBassett: ForceReauth: Avoid unnecessary securitySensitiveOperationStatus checks [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302267 [21:10:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302267 (owner: 10SBassett) [21:12:02] (03PS1) 10BCornwall: admin: Actually add bliviero to a-d-u [puppet] - 10https://gerrit.wikimedia.org/r/1302268 (https://phabricator.wikimedia.org/T428815) [21:12:52] (03PS1) 10BCornwall: admin: Add echukwukere to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/1302270 (https://phabricator.wikimedia.org/T428827) [21:13:35] (03Merged) 10jenkins-bot: ForceReauth: Avoid unnecessary securitySensitiveOperationStatus checks [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302267 (owner: 10SBassett) [21:13:55] !log sbassett@deploy1003 Started scap sync-world: Backport for [[gerrit:1302267|ForceReauth: Avoid unnecessary securitySensitiveOperationStatus checks]] [21:15:39] !log sbassett@deploy1003 sbassett: Backport for [[gerrit:1302267|ForceReauth: Avoid unnecessary securitySensitiveOperationStatus checks]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:15:44] (03PS12) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [21:17:50] !log sbassett@deploy1003 sbassett: Continuing with deployment [21:18:55] (03CR) 10RLazarus: [C:03+1] admin: Actually add bliviero to a-d-u [puppet] - 10https://gerrit.wikimedia.org/r/1302268 (https://phabricator.wikimedia.org/T428815) (owner: 10BCornwall) [21:20:08] (03PS1) 10BCornwall: admin: Add mfossati to ml-lab-users [puppet] - 10https://gerrit.wikimedia.org/r/1302272 (https://phabricator.wikimedia.org/T429148) [21:20:29] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team (Q4 FY2025-26), 13Patch-For-Review: Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148#12021466 (10BCornwall) 05Open→03In progress [21:21:01] (03CR) 10BCornwall: [C:03+2] admin: Actually add bliviero to a-d-u [puppet] - 10https://gerrit.wikimedia.org/r/1302268 (https://phabricator.wikimedia.org/T428815) (owner: 10BCornwall) [21:22:05] !log sbassett@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302267|ForceReauth: Avoid unnecessary securitySensitiveOperationStatus checks]] (duration: 08m 11s) [21:22:22] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Bliviero - https://phabricator.wikimedia.org/T428815#12021470 (10BCornwall) I apologize, but there was one important bit that I forgot to do - that's been fixed and your access sho... [21:24:24] (03PS13) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [21:25:07] (03PS1) 10Krinkle: Disable ShortUrl on hiwiki, hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302274 (https://phabricator.wikimedia.org/T107188) [21:28:54] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Bliviero - https://phabricator.wikimedia.org/T428815#12021477 (10BLiviero-WMF) thank you! i was still having trouble getting to some data cubes but was afraid to ask, glad it is g... [21:31:28] (03PS2) 10Krinkle: Disable ShortUrl on hiwiki, hiwikiversity, knwiki, knwikisource, tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302274 (https://phabricator.wikimedia.org/T107188) [21:39:29] (03PS1) 10Ahmon Dancy: beta: Point remaining db11 references at deployment-db15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302277 (https://phabricator.wikimedia.org/T428930) [21:40:25] !log Deployed security fix for T428820 [21:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:50] (03CR) 10Ahmon Dancy: [C:03+2] beta: Point remaining db11 references at deployment-db15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302277 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [21:41:04] sbassett: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1302277 many be included in your deployment. It's a no-op for production. [21:41:09] *may be [21:41:43] (03Merged) 10jenkins-bot: beta: Point remaining db11 references at deployment-db15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302277 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [21:42:33] dancy: ok, no problem [21:43:17] I’m doing sync-files for the remaining security patches though. I did a sync-world for the first one via spiderpig though. [21:43:26] OK THanks [21:48:29] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1302277|beta: Point remaining db11 references at deployment-db15 (T428930)]] [21:48:33] T428930: Set up deployment-db15 with Trixie and wmf-mariadb1011 - https://phabricator.wikimedia.org/T428930 [21:48:58] !log Deployed security fix for T428809 [21:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:23] !log dancy@deploy1003 dancy: Backport for [[gerrit:1302277|beta: Point remaining db11 references at deployment-db15 (T428930)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:51:51] dancy: Er, I have two more patches to go... [21:53:22] !log dancy@deploy1003 dancy: Continuing with deployment [21:53:50] sbassett: Sorry about that. I'll be out of the way in a couple of minutes [21:54:55] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302277|beta: Point remaining db11 references at deployment-db15 (T428930)]] (duration: 12m 27s) [21:55:00] T428930: Set up deployment-db15 with Trixie and wmf-mariadb1011 - https://phabricator.wikimedia.org/T428930 [21:55:12] sbassett: I'm done. Apologies again. [21:56:06] No prob, thanks [22:03:36] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [22:04:19] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [22:04:21] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [22:04:51] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [22:04:54] (03PS1) 10Bking: cirrussearch: Add minimal opensearch config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1302280 (https://phabricator.wikimedia.org/T425585) [22:05:02] !log Deployed updated security fix for T427611 [22:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:27] (03CR) 10CI reject: [V:04-1] cirrussearch: Add minimal opensearch config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1302280 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [22:08:55] (03PS2) 10Bking: cirrussearch: Add minimal opensearch config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1302280 (https://phabricator.wikimedia.org/T425585) [22:11:09] (03CR) 10JHathaway: [C:03+1] mx-out: Enable profile::auto_restarts::service for Dovecot [puppet] - 10https://gerrit.wikimedia.org/r/1300804 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [22:35:45] !log Deployed private config for T429244 [22:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:25] Ok, all done with security patches for today. [22:44:09] (03CR) 10RLazarus: "Should we also delete the test at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile" [puppet] - 10https://gerrit.wikimedia.org/r/1302106 (https://phabricator.wikimedia.org/T418492) (owner: 10Clément Goubert) [22:52:34] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260615T2300) [23:17:54] (03CR) 10RLazarus: mediawiki: Use utf-8 for text/plain and text/html. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1301338 (https://phabricator.wikimedia.org/T428772) (owner: 10Blake) [23:27:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:04] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1301846 (owner: 10TrainBranchBot) [23:42:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1302284 [23:42:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1302284 (owner: 10TrainBranchBot) [23:49:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1302284 (owner: 10TrainBranchBot)