[00:00:43] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [00:08:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1194326 [00:08:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1194326 (owner: 10TrainBranchBot) [00:09:41] !log Deployed security mitigation for T406664 to 1.45.0-wmf.22 [00:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:19] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [00:27:38] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [00:29:56] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1194326 (owner: 10TrainBranchBot) [00:44:53] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:48:52] marostegui@cumin1003 clone_es (PID 1652883) is awaiting input [01:01:02] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:14:15] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 13s) [01:30:51] fceratto@cumin1002 clone_es (PID 4029038) is awaiting input [01:36:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:41:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:07] marostegui@cumin1003 clone_es (PID 1656611) is awaiting input [01:50:48] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1018.eqiad.wmnet with reason: host reimage [01:54:45] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1018.eqiad.wmnet with reason: host reimage [02:05:15] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [02:09:39] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [02:09:53] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:10:57] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [02:12:22] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1018.eqiad.wmnet with OS bullseye [02:19:17] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [02:24:50] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit2003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [02:48:50] (03PS1) 10DLynch: Launch VisualEditor EditCheck paste check a/b test to 22 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194334 (https://phabricator.wikimedia.org/T405422) [03:37:01] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-internal-main host T405978 [03:37:04] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [03:38:38] !log ryankemper@cumin2002 conftool action : set/pooled=no:weight=10; selector: name=wdqs1018.* [03:41:49] !log ryankemper@cumin2002 conftool action : GET; selector: name=wdqs1018.eqiad.wmnet [03:47:39] (03PS1) 10Ryan Kemper: wdqs: provision wdqs1018 for wdqs-main [puppet] - 10https://gerrit.wikimedia.org/r/1194336 (https://phabricator.wikimedia.org/T405978) [03:49:55] (03CR) 10Ryan Kemper: [C:03+2] "Low-touch patch; self merging to get wdqs1018 online" [puppet] - 10https://gerrit.wikimedia.org/r/1194336 (https://phabricator.wikimedia.org/T405978) (owner: 10Ryan Kemper) [03:52:48] !log ryankemper@cumin2002 conftool action : set/pooled=no:weight=10; selector: name=wdqs1018.* [03:53:12] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-internal-main host T405978 (duration: 16m 11s) [03:53:15] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [03:53:15] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-internal-main host T405978 [03:53:45] (03CR) 10RLazarus: [C:03+1] mw-*: Tune 8.3 releases to prevent deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192954 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [03:55:17] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-internal-main host T405978 (duration: 02m 01s) [03:56:28] (03CR) 10RLazarus: [C:03+1] "That estimate_mw_replicas script is really nicely done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194256 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [03:59:28] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:01:46] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:01:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:04:12] FIRING: [4x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:28] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:06:37] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54973 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:06:37] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:09:28] RESOLVED: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:12:21] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:14:43] FIRING: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:24:43] RESOLVED: SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:24:49] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:27:17] FIRING: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:27:23] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1018 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:32:23] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1018 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:37:19] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-main host T405978 [04:37:23] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:37:23] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [04:37:33] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-main host T405978 (duration: 00m 14s) [04:38:21] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1018 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:38:23] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1018 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:38:23] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1018 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:38:23] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1018 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:39:12] FIRING: [4x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:41:16] ryankemper@cumin2002 reimage (PID 916765) is awaiting input [04:41:28] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1018.eqiad.wmnet with OS bullseye [04:44:54] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:56:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:01:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:08:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 23.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 23.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 22.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:19:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 24.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:33:09] ryankemper@cumin2002 reimage (PID 916765) is awaiting input [05:34:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:37:25] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1028 gradually with 4 steps - Pool es1028.eqiad.wmnet in after cloning [05:37:31] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1026 gradually with 4 steps - Pool es1026.eqiad.wmnet in after cloning [05:38:24] (03PS1) 10Ryan Kemper: wdqs: move wdqs1018 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1194341 [05:39:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:40] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:30] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11253249 (10WMDECyn) Approved from WMDE side [05:52:51] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11253250 (10WMDECyn) Approved from WMDE side [05:56:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 19.24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T0600) [06:01:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 20.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:09:54] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:10:59] (03PS1) 10Marostegui: es1051: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1194343 (https://phabricator.wikimedia.org/T406488) [06:12:08] (03CR) 10Marostegui: [C:03+2] es1051: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1194343 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [06:12:54] !log rebalance Ganeti eqiad/D following vmscape reboots [06:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:27] (03PS1) 10Marostegui: instances.yaml: Add es1049 and es1051 [puppet] - 10https://gerrit.wikimedia.org/r/1194344 (https://phabricator.wikimedia.org/T406488) [06:17:05] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1049 and es1051 [puppet] - 10https://gerrit.wikimedia.org/r/1194344 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [06:17:58] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete ganeti_init.sh script [puppet] - 10https://gerrit.wikimedia.org/r/1189117 (owner: 10Muehlenhoff) [06:24:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add es1049 and es1051 to dbctl depooled T406488', diff saved to https://phabricator.wikimedia.org/P83659 and previous config saved to /var/cache/conftool/dbconfig/20251008-062404-marostegui.json [06:24:08] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [06:24:27] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1028 gradually with 4 steps - Pool es1028.eqiad.wmnet in after cloning [06:24:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es1028.eqiad.wmnet onto es1051.eqiad.wmnet [06:24:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1026 gradually with 4 steps - Pool es1026.eqiad.wmnet in after cloning [06:24:34] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es1026.eqiad.wmnet onto es1049.eqiad.wmnet [06:25:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1027,1030].eqiad.wmnet with reason: Cloning [06:27:43] (03PS1) 10Marostegui: mariadb: Productionize es1050 [puppet] - 10https://gerrit.wikimedia.org/r/1194354 (https://phabricator.wikimedia.org/T406488) [06:27:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1027 T406488', diff saved to https://phabricator.wikimedia.org/P83662 and previous config saved to /var/cache/conftool/dbconfig/20251008-062752-marostegui.json [06:28:41] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1050 [puppet] - 10https://gerrit.wikimedia.org/r/1194354 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [06:29:41] !log installing openssl security updates [06:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:58] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone_es of es1027.eqiad.wmnet onto es1050.eqiad.wmnet [06:34:03] (03CR) 10Arnaudb: [C:03+2] gerrit: mod_qos tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1193597 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [06:35:39] marostegui@cumin1003 clone_es (PID 1785807) is awaiting input [06:38:36] (03PS1) 10Arnaudb: Revert "gerrit: mod_qos tweaks" [puppet] - 10https://gerrit.wikimedia.org/r/1194361 [06:40:19] (03Abandoned) 10Arnaudb: Revert "gerrit: mod_qos tweaks" [puppet] - 10https://gerrit.wikimedia.org/r/1194361 (owner: 10Arnaudb) [06:41:39] (03PS1) 10Arnaudb: gerrit: hotfix mod_qos syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1194363 (https://phabricator.wikimedia.org/T406403) [06:44:58] (03CR) 10Arnaudb: [C:03+2] "# comments at the end of a config line are not valid" [puppet] - 10https://gerrit.wikimedia.org/r/1194363 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [06:50:07] (03CR) 10Jelto: [C:03+1] gerrit: hotfix mod_qos syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1194363 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [06:53:00] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host idp-test2005.wikimedia.org with OS trixie [06:55:13] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host idp-test2005.wikimedia.org with OS trixie [06:57:31] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1018.eqiad.wmnet with OS bullseye [07:00:04] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:01:40] (03CR) 10Elukey: osm: refactor swift scripts and make event-template dynamic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [07:05:47] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2051.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:15:41] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2051.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:16:18] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2052.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:16:38] (03CR) 10Muehlenhoff: [C:03+2] Create /etc/wikimedia in the cloud VPS base class [puppet] - 10https://gerrit.wikimedia.org/r/1194156 (owner: 10Muehlenhoff) [07:16:56] (03PS1) 10Marostegui: mariadb: Productionize es1053 [puppet] - 10https://gerrit.wikimedia.org/r/1194426 (https://phabricator.wikimedia.org/T406488) [07:16:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1030 T406488', diff saved to https://phabricator.wikimedia.org/P83663 and previous config saved to /var/cache/conftool/dbconfig/20251008-071656-marostegui.json [07:17:00] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:17:34] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [07:17:54] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1053 [puppet] - 10https://gerrit.wikimedia.org/r/1194426 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [07:21:27] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone_es of es1030.eqiad.wmnet onto es1053.eqiad.wmnet [07:22:52] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host idp-test2005.wikimedia.org with OS trixie [07:24:16] jdlrobson: 👋 looks like you accidentally left a backport waiting for confirmation since yesterday. I've canceled it so the train can be rolled out [07:25:59] (03Abandoned) 10Muehlenhoff: Use wmflib::dir::mkdir_p to create /etc/wikimedia/maps [puppet] - 10https://gerrit.wikimedia.org/r/1194105 (owner: 10Muehlenhoff) [07:26:48] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:27:56] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [07:29:48] (03Abandoned) 10Muehlenhoff: Apply replica role to maps1012-1014 [puppet] - 10https://gerrit.wikimedia.org/r/1188308 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:37:23] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2052.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:44:37] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2053.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:46:10] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2053.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:47:20] (03PS1) 10Majavah: P:toolforge: Migrate error page static assets to tools-static [puppet] - 10https://gerrit.wikimedia.org/r/1194532 (https://phabricator.wikimedia.org/T283948) [07:47:23] (03PS1) 10Majavah: P:toolforge: Migrate default robots and favicon handlers to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1194533 (https://phabricator.wikimedia.org/T283948) [07:47:46] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2054.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:48:56] (03PS1) 10Marostegui: Bug: T406488 [puppet] - 10https://gerrit.wikimedia.org/r/1194535 (https://phabricator.wikimedia.org/T406488) [07:49:14] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2054.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:49:24] (03PS2) 10Marostegui: installserver: Remove es1049 [puppet] - 10https://gerrit.wikimedia.org/r/1194535 (https://phabricator.wikimedia.org/T406488) [07:50:28] (03PS2) 10Majavah: P:toolforge: Migrate default robots and favicon handlers to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1194533 (https://phabricator.wikimedia.org/T283948) [07:51:59] (03CR) 10Marostegui: [C:03+2] installserver: Remove es1049 [puppet] - 10https://gerrit.wikimedia.org/r/1194535 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [07:52:20] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:55:13] (03PS1) 10Marostegui: db2172: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194539 (https://phabricator.wikimedia.org/T406541) [07:55:48] (03CR) 10Marostegui: [C:03+2] db2172: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194539 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [07:56:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2172.codfw.wmnet with reason: Maintenance [07:56:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2172 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83664 and previous config saved to /var/cache/conftool/dbconfig/20251008-075612-marostegui.json [07:59:02] (03PS3) 10Majavah: P:toolforge: Migrate default robots and favicon handlers to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1194533 (https://phabricator.wikimedia.org/T283948) [07:59:17] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1193926 (https://phabricator.wikimedia.org/T389666) (owner: 10Ottomata) [08:00:02] !log installing libxml2 security updates [08:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T0800) [08:00:21] hi, the train will run in a few minutes [08:01:47] (03CR) 10Marostegui: clone_es.py: clone readonly es* hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto) [08:02:30] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:03:47] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host idp-test2005.wikimedia.org with OS trixie [08:04:39] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194540 (https://phabricator.wikimedia.org/T405678) [08:04:42] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194540 (https://phabricator.wikimedia.org/T405678) (owner: 10TrainBranchBot) [08:04:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2172 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83665 and previous config saved to /var/cache/conftool/dbconfig/20251008-080448-root.json [08:05:53] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194540 (https://phabricator.wikimedia.org/T405678) (owner: 10TrainBranchBot) [08:09:28] (03PS1) 10Slyngshede: site.pp move idp-test2005 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1194543 (https://phabricator.wikimedia.org/T406455) [08:09:48] (03PS1) 10Marostegui: migration1011.sh: Add to repo [software] - 10https://gerrit.wikimedia.org/r/1194544 (https://phabricator.wikimedia.org/T406008) [08:10:23] (03CR) 10Marostegui: [C:03+2] migration1011.sh: Add to repo [software] - 10https://gerrit.wikimedia.org/r/1194544 (https://phabricator.wikimedia.org/T406008) (owner: 10Marostegui) [08:10:48] (03Merged) 10jenkins-bot: migration1011.sh: Add to repo [software] - 10https://gerrit.wikimedia.org/r/1194544 (https://phabricator.wikimedia.org/T406008) (owner: 10Marostegui) [08:13:40] (03CR) 10Muehlenhoff: site.pp move idp-test2005 to insetup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194543 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [08:14:43] (03PS2) 10Slyngshede: site.pp move idp-test2005 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1194543 (https://phabricator.wikimedia.org/T406455) [08:14:54] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.22 refs T405678 [08:14:58] T405678: 1.45.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T405678 [08:19:36] (03PS3) 10Slyngshede: site.pp move idp-test2005 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1194543 (https://phabricator.wikimedia.org/T406455) [08:19:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2172 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83666 and previous config saved to /var/cache/conftool/dbconfig/20251008-081953-root.json [08:21:21] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2057.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:23:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1194543 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [08:31:47] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2057.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:33:40] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2058.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:33:57] (03PS6) 10Elukey: osm: refactor swift scripts and make event-template dynamic [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) [08:34:17] (03CR) 10Elukey: osm: refactor swift scripts and make event-template dynamic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:35:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2172 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83667 and previous config saved to /var/cache/conftool/dbconfig/20251008-083459-root.json [08:38:37] (03CR) 10Jgiannelos: [C:03+1] osm: refactor swift scripts and make event-template dynamic [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:44:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2058.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:44:29] (03PS7) 10Elukey: osm: refactor swift scripts and make event-template dynamic [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) [08:44:30] (03PS1) 10Elukey: Enable tiles invalidation on maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1194553 (https://phabricator.wikimedia.org/T381565) [08:44:54] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:45:45] (03PS2) 10Elukey: Disable tiles invalidation on maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1194553 (https://phabricator.wikimedia.org/T381565) [08:50:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2172 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83669 and previous config saved to /var/cache/conftool/dbconfig/20251008-085005-root.json [08:50:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11253656 (10elukey) All cp hosts (but 2056) have the latest bios+idrac and I've run the complete version of the provision cookbook to apply the whole set of BIO... [08:50:55] (03CR) 10Jgiannelos: [C:03+1] Disable tiles invalidation on maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1194553 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:52:15] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2027 gradually with 4 steps - Pool es2027.codfw.wmnet in after cloning [08:53:25] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge: Migrate error page static assets to tools-static [puppet] - 10https://gerrit.wikimedia.org/r/1194532 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [08:54:12] (03CR) 10Majavah: [C:03+2] P:toolforge: Migrate error page static assets to tools-static [puppet] - 10https://gerrit.wikimedia.org/r/1194532 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [08:54:27] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge: Migrate default robots and favicon handlers to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1194533 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [08:55:10] (03PS4) 10Majavah: P:toolforge: Migrate default robots and favicon handlers to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1194533 (https://phabricator.wikimedia.org/T283948) [08:57:20] (03PS3) 10Federico Ceratto: major-upgrade.py: MariaDB version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) [08:57:33] (03CR) 10Majavah: [C:03+2] P:toolforge: Migrate default robots and favicon handlers to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1194533 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [08:58:18] (03CR) 10Federico Ceratto: major-upgrade.py: MariaDB version upgrade cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [08:59:42] (03CR) 10Elukey: [C:03+2] Disable tiles invalidation on maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1194553 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:00:08] PROBLEM - Host cr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:00:09] PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [09:00:21] is bast3007 not responding to anyone else? [09:00:29] !incidents [09:00:29] 6840 (UNACKED) Host cr1-esams [09:00:30] 6839 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [09:00:38] !accept 6840 [09:00:43] !ack 6840 [09:00:44] 6840 (ACKED) Host cr1-esams [09:00:53] oh, thank you [09:00:57] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:01:16] !incidents [09:01:16] 6840 (ACKED) Host cr1-esams [09:01:16] 6841 (UNACKED) ProbeDown sre (2a02:ec80:300:ed1a::1 ip6 text-https:443 probes/service http_text-https_ip6 esams) [09:01:16] 6839 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [09:01:22] !ack 6841 [09:01:23] 6841 (ACKED) ProbeDown sre (2a02:ec80:300:ed1a::1 ip6 text-https:443 probes/service http_text-https_ip6 esams) [09:01:27] <_joe_> should we depool esams? [09:01:33] <_joe_> I think so [09:01:41] yes, I think something is wrong in esams [09:01:43] <_joe_> topranks: ^^ [09:01:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:01:53] core router down, see above [09:01:54] can I help? [09:01:55] here [09:01:58] oh great [09:02:02] * topranks looking [09:02:06] I'm going to depool esams unless someone shouts [09:02:11] yes, please [09:02:12] ack [09:02:16] sounds like a good idea, thank you [09:02:18] depool sgtm [09:02:28] !log mvernon@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site esams [reason: no reason specified, ] [09:02:34] !log depool esams [09:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site esams [reason: no reason specified, ] [09:02:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:02:47] {{done}} [09:03:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:03:18] (03PS8) 10Elukey: osm: refactor swift scripts and make event-template dynamic [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) [09:03:23] (03CR) 10CI reject: [V:04-1] major-upgrade.py: MariaDB version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [09:03:39] FIRING: [5x] TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:04:02] I don't see traffic or errors recovering [09:04:12] FIRING: NetworkDeviceAlarmActive: Alarm active on cr1-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [09:04:12] there were also some router alerts for esams earlier this night, but nothing pag.ing afaics [09:04:13] (03CR) 10Slyngshede: [C:03+2] site.pp move idp-test2005 to insetup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194543 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:04:18] (I know it has a ttl) [09:04:19] (03CR) 10Elukey: osm: refactor swift scripts and make event-template dynamic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:04:27] topranks: do you need further assistance wrt esams? [09:04:39] Emperor: think I'm ok for now [09:04:45] on the affected device via OOB / serial now [09:04:47] I think now I see it [09:04:49] seems line card in slot 0 is down [09:04:54] topranks: ack, do shout if oncall can help [09:05:14] from grafana graphs probably things are going to start to stablise if that is the only issue [09:05:25] as in.... bgp routing converging and traffic diverting via the other CR [09:05:30] does this need a statuspage update? [09:05:32] obviously we depooled in dns which is the right call [09:05:57] RESOLVED: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:06:01] taavi: I think not, assuming that depooling esams is sufficient [09:06:08] taavi: we could perhaps say we are investigating issues in europe, I'm unsure of exact user impact [09:06:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [09:06:18] FIRING: [7x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from DE) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [09:06:26] likely esams is returning to health on the other leg anyway [09:06:26] 5xx are also down again (not fully recovered) [09:06:28] !incidents [09:06:29] 6840 (ACKED) Host cr1-esams [09:06:29] 6842 (UNACKED) NELHigh sre (thanos-rule tcp.timed_out) [09:06:29] 6841 (RESOLVED) ProbeDown sre (2a02:ec80:300:ed1a::1 ip6 text-https:443 probes/service http_text-https_ip6 esams) [09:06:30] 6839 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [09:06:34] that NEL alert seems to indicate some user impact I guess :D [09:06:38] !ack 6842 [09:06:38] !ack 6842 [09:06:38] 6842 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [09:06:38] 6842 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [09:06:41] (03PS1) 10Krinkle: varnish: Remove unused "Mobile Redirect" logic [puppet] - 10https://gerrit.wikimedia.org/r/1194558 (https://phabricator.wikimedia.org/T405931) [09:06:51] FIRING: [8x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:ae1 (External: AMS-IX 3x10G mynl-mem-9305) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:07:22] I can write a statuspage update [09:07:25] at that moment we issued a fairly amount of 5XX, as expected: https://grafana.wikimedia.org/goto/EPMhDM6HR?orgId=1 [09:07:39] FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:08:03] jelto: OK, I guess we can say we're monitoring at this point [09:08:20] !log disable BGP to asw*-esams from cr1-esams as the CR external links are also down [09:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:39] {{done}} [09:09:56] NELs should resolve soon [09:10:19] yeah, I'd expect things to sort themselves out as DNS caches update [09:10:35] cache invalidation being known to be simple & straightforward :) [09:11:09] NELs from the UK (taken as an example) are back to normal after a massive spike in tcp.timed_out [09:11:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [09:11:18] RESOLVED: [7x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from DE) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [09:11:25] 🥳 [09:11:33] !incidents [09:11:33] 6840 (ACKED) Host cr1-esams [09:11:33] 6842 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [09:11:34] 6841 (RESOLVED) ProbeDown sre (2a02:ec80:300:ed1a::1 ip6 text-https:443 probes/service http_text-https_ip6 esams) [09:11:34] 6839 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service eqiad) [09:11:34] that will be due to both users now hitting drmrs and the recovery in esams as traffic reconverged via cr2 [09:11:41] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on remaining Wikipedias except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194562 (https://phabricator.wikimedia.org/T403510) [09:11:43] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194563 (https://phabricator.wikimedia.org/T403510) [09:12:04] (03PS11) 10Krinkle: varnish: Enable unified mobile routing on en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192272 (https://phabricator.wikimedia.org/T403510) [09:12:39] FIRING: [12x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:12:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:13:39] RESOLVED: [5x] TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:14:12] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr1-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [09:14:30] (03CR) 10Elukey: [C:03+2] osm: refactor swift scripts and make event-template dynamic [puppet] - 10https://gerrit.wikimedia.org/r/1193807 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:14:51] topranks: let me know if you need anything. I can also open a task and backfill the current status. [09:14:52] naive question: we are repooling when cr1 is back to life? or can cr2 handle all of that traffic alone? [09:16:17] I think that's up to t.opranks [09:16:43] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to esams RIPE Atlas anchor: failures over threshold for measurement 59935536 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:16:51] FIRING: [8x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:ae1 (External: AMS-IX 3x10G mynl-mem-9305) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:19:01] (03CR) 10Ladsgroup: [C:03+1] Disable wmgUseMdotRouting on remaining Wikipedias except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194562 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [09:21:25] jelto: cr2 should be able to handle all the traffic, but I think it's probably best we stay depooled at least for now [09:21:43] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to esams RIPE Atlas anchor: failures over threshold for measurement 59935536 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:23:11] ack, sounds good to me, do you need a task? I'm happy to open one if needed [09:23:25] (03PS1) 10MVernon: swift: remove 3 drained codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1194566 (https://phabricator.wikimedia.org/T354872) [09:23:44] I'm just about to do so, I think the "panic" part is over so I'll go ahead and do that, then raise the issue with JTAC [09:24:02] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [09:24:03] it's definitely a hardware failure, logs full of stuff about CRC errors communicating with the line card, timeouts talking to it etc [09:24:35] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [09:25:06] :( [09:25:29] drmrs seems fine - its handling the combined throughput level esams+drmrs was doing prior to the issue [09:25:39] okay thanks for looking into this. That sounds bad [09:26:11] https://grafana.wikimedia.org/goto/Dce0dMeNg [09:26:15] so we leave esams depooled for now? Should I resolve the statuspage incident then? [09:26:36] jelto: I think so (to both your questions :) ) [09:26:48] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:26:52] yes I think we can resolve the statuspage incident [09:27:13] {{done}} [09:27:16] (03PS4) 10Federico Ceratto: migrate.py: MariaDB version migration cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) [09:30:57] (03CR) 10Marostegui: [C:03+1] swift: remove 3 drained codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1194566 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [09:31:23] (03CR) 10Federico Ceratto: [C:03+1] "Checked hostnames against tasks and description." [puppet] - 10https://gerrit.wikimedia.org/r/1194566 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [09:32:57] (03CR) 10MVernon: [C:03+2] swift: remove 3 drained codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1194566 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [09:35:23] FIRING: GnmiTargetDown: cr1-esams is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [09:36:30] (03PS1) 10Elukey: Enable tiles invalidation for maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1194569 (https://phabricator.wikimedia.org/T381565) [09:36:51] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host idp-test2005.wikimedia.org with OS trixie [09:37:04] (03CR) 10Elukey: [C:03+2] profile::puppetserver::backup: add a backup for /var/lib/puppet/ssl [puppet] - 10https://gerrit.wikimedia.org/r/1194192 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey) [09:37:35] 14SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 06DBA, 10MediaWiki-libs-Rdbms, 07Epic: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255#11253837 (10Reedy) [09:37:42] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2027 gradually with 4 steps - Pool es2027.codfw.wmnet in after cloning [09:37:43] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es2027.codfw.wmnet onto es2052.codfw.wmnet [09:41:40] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:51] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1194569 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:42:45] (03CR) 10Elukey: [C:03+2] Enable tiles invalidation for maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1194569 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:47:50] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [09:49:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [09:51:13] (03PS1) 10Btullis: Disable monitoring for an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1194570 (https://phabricator.wikimedia.org/T402943) [09:51:56] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams: MPC7E 3D 40XGE line card in slot 0 failure [Oct 2025] - https://phabricator.wikimedia.org/T406705 (10cmooney) 03NEW p:05Triage→03High [09:52:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7229/console" [puppet] - 10https://gerrit.wikimedia.org/r/1194570 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [09:55:57] (03PS1) 10Majavah: P:toolforge::proxy: Remove now-unused static files [puppet] - 10https://gerrit.wikimedia.org/r/1194572 [09:56:07] (03PS1) 10Hashar: Fix link to task in the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194573 [09:56:42] (03PS2) 10Majavah: P:toolforge::proxy: Remove now-unused static files [puppet] - 10https://gerrit.wikimedia.org/r/1194572 [09:57:37] (03CR) 10Hashar: [C:03+2] Add a banner for a Gerrit switch over maintenance (031 comment) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1193017 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T1000) [10:00:44] (03CR) 10Tacsipacsi: "Thanks for the patch!" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194573 (owner: 10Hashar) [10:02:45] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [10:05:58] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::proxy: Remove now-unused static files [puppet] - 10https://gerrit.wikimedia.org/r/1194572 (owner: 10Majavah) [10:09:54] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:10:12] (03CR) 10MVernon: [C:03+2] wmflib: discard new directory entries from swift_disks fact [puppet] - 10https://gerrit.wikimedia.org/r/1193797 (https://phabricator.wikimedia.org/T404351) (owner: 10MVernon) [10:13:32] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for the metamoniting endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1194578 (https://phabricator.wikimedia.org/T135991) [10:14:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all [10:14:46] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams: MPC7E 3D 40XGE line card in slot 0 failure [Oct 2025] - https://phabricator.wikimedia.org/T406705#11253997 (10cmooney) JTAC Case 2025-1008-891506 raised. [10:15:16] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2078.codfw.wmnet with OS trixie [10:15:26] (03PS1) 10Btullis: Mimic the signing behaviour of the apt module for thirdparty/bigtop15 [puppet] - 10https://gerrit.wikimedia.org/r/1194579 (https://phabricator.wikimedia.org/T406148) [10:15:33] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: swift_disks fact needs to cope with change in /dev/disk/by-path in trixie - https://phabricator.wikimedia.org/T404351#11253998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host ms-be2078.codfw.wmnet with OS tr... [10:16:17] (03CR) 10Btullis: [V:03+1 C:03+2] Disable monitoring for an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1194570 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [10:16:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11254002 (10elukey) Hi Brian! In the above use case IIUC your new host is being added to Netbox's data, and even if it is puppet-specific for... [10:16:25] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [10:17:44] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host idp-test2005.wikimedia.org with OS trixie [10:19:28] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194158 (owner: 10Muehlenhoff) [10:20:11] (03CR) 10Majavah: [C:03+2] P:toolforge::proxy: Remove now-unused static files [puppet] - 10https://gerrit.wikimedia.org/r/1194572 (owner: 10Majavah) [10:20:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [10:20:50] !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [10:22:19] !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [10:27:32] (03PS2) 10Btullis: Mimic the signing behaviour of the apt module for thirdparty/bigtop15 [puppet] - 10https://gerrit.wikimedia.org/r/1194579 (https://phabricator.wikimedia.org/T406148) [10:29:04] (03CR) 10Vgutierrez: [C:03+1] gateway-check: Group-based routing approach [puppet] - 10https://gerrit.wikimedia.org/r/1193903 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [10:29:39] !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [10:30:54] !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [10:31:13] !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [10:33:34] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:33:55] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [10:34:05] !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [10:34:10] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:37:33] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams: MPC7E 3D 40XGE line card in slot 0 failure [Oct 2025] - https://phabricator.wikimedia.org/T406705#11254135 (10cmooney) Typical lackluster from Juniper. After finally looking at the logs they requested we re-seat the card, so I will work to create a r... [10:39:02] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [10:42:46] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [10:43:08] !log Disabling puppet on cp nodes - 1193903: gateway-check: Group-based routing approach | https://gerrit.wikimedia.org/r/c/operations/puppet/+/1193903 - T406318 [10:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:11] T406318: rest.php via rest-gateway production rollout - https://phabricator.wikimedia.org/T406318 [10:43:37] (03CR) 10Muehlenhoff: "Two remaining comments inline, looks good otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [10:46:36] (03CR) 10Clément Goubert: [C:03+2] gateway-check: Group-based routing approach [puppet] - 10https://gerrit.wikimedia.org/r/1193903 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [10:53:17] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host idp-test2005.wikimedia.org with OS bookworm [11:00:05] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T1100). [11:00:11] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1191326 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi) [11:09:47] !log imported megacli into thirdparty/hwraid (upstream repo doesn't cover trixie yet, copied over from bookworm) T391083 [11:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:51] T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083 [11:18:00] (03PS1) 10Esanders: Revert "Invalidate Flow cache on enwiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194588 [11:18:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194588 (owner: 10Esanders) [11:19:41] !log mvernon@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2078.codfw.wmnet with OS trixie [11:19:53] 06SRE, 10SRE-swift-storage: swift_disks fact needs to cope with change in /dev/disk/by-path in trixie - https://phabricator.wikimedia.org/T404351#11254305 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host ms-be2078.codfw.wmnet with OS trixie executed with errors:... [11:21:28] 06SRE, 10SRE-swift-storage: swift_disks fact needs to cope with change in /dev/disk/by-path in trixie - https://phabricator.wikimedia.org/T404351#11254320 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon The new fact works; the failure is because the following key packages are not available in t... [11:22:04] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2078.codfw.wmnet with OS bullseye [11:22:14] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11254323 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host ms-be2078.codfw.wmnet with OS bullseye [11:22:21] !log mvernon@cumin1002 START - Cookbook sre.hosts.move-vlan for host ms-be2078 [11:22:36] !log mvernon@cumin1002 START - Cookbook sre.dns.netbox [11:25:00] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:25:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mw135[8-9], mw136[4-6], mw137[2-3], mw140[0-4], mw1406, mw14[11-13] - https://phabricator.wikimedia.org/T383227#11254329 (10Jclark-ctr) [11:26:21] !log Enabling puppet on cp nodes - 1193903: gateway-check: Group-based routing approach | https://gerrit.wikimedia.org/r/c/operations/puppet/+/1193903 - T406318 [11:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:25] T406318: rest.php via rest-gateway production rollout - https://phabricator.wikimedia.org/T406318 [11:27:28] (03Abandoned) 10Clément Goubert: gateway-check: Introduce regex matching [puppet] - 10https://gerrit.wikimedia.org/r/1193882 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [11:27:33] (03CR) 10Santiago Faci: [C:03+1] "Just a comment about a couple of attributes that are added by default by the platform. It will work anyway though" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194175 (https://phabricator.wikimedia.org/T406359) (owner: 10Milimetric) [11:27:58] !log mvernon@cumin1002 START - Cookbook sre.dns.netbox [11:28:03] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker-codfw [11:30:06] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:dse-k8s-worker-codfw [11:31:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mw135[8-9], mw136[4-6], mw137[2-3], mw140[0-4], mw1406, mw14[11-13] - https://phabricator.wikimedia.org/T383227#11254364 (10Jclark-ctr) 05Open→03Resolved [11:32:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mw135[8-9], mw136[4-6], mw137[2-3], mw140[0-4], mw1406, mw14[11-13] - https://phabricator.wikimedia.org/T383227#11254367 (10Jclark-ctr) [11:32:57] (03Abandoned) 10Hnowlan: trafficserver: rest-gateway routes for rest.php: group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1193811 (https://phabricator.wikimedia.org/T406318) (owner: 10Hnowlan) [11:33:00] (03Abandoned) 10Hnowlan: trafficserver: rest-gateway routes for rest.php: group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1193810 (https://phabricator.wikimedia.org/T406318) (owner: 10Hnowlan) [11:33:02] (03Abandoned) 10Hnowlan: trafficserver: rest-gateway routes for rest.php: group1 1% [puppet] - 10https://gerrit.wikimedia.org/r/1193809 (https://phabricator.wikimedia.org/T406318) (owner: 10Hnowlan) [11:33:04] (03Abandoned) 10Hnowlan: trafficserver: rest-gateway routes for rest.php: group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1193808 (https://phabricator.wikimedia.org/T406318) (owner: 10Hnowlan) [11:33:07] (03Abandoned) 10Hnowlan: trafficserver: rest-gateway routes for rest.php: group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1193805 (https://phabricator.wikimedia.org/T406318) (owner: 10Hnowlan) [11:33:40] mvernon@cumin1002 reimage (PID 2293117) is awaiting input [11:34:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-eqiad: fan failure on left tray [Oct 2025] - https://phabricator.wikimedia.org/T406554#11254376 (10Jclark-ctr) Replaced fan modular with spare from storage room. Pending tac ticket with juniper [11:34:02] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host idp-test2005.wikimedia.org with OS bookworm [11:34:56] !log mvernon@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2078 - mvernon@cumin1002" [11:35:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-eqiad: fan failure on left tray [Oct 2025] - https://phabricator.wikimedia.org/T406554#11254391 (10Jclark-ctr) Verified fan speeds compared with cr1 ` jclark@re0.cr1-eqiad> show chassis fan Item Status... [11:37:58] (03CR) 10Hnowlan: "Overall this makes sense to me, but just wanted to clarify: these alerts are for the rest-gateway. By the end of this week hopefully rest." [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse) [11:38:02] mvernon@cumin1002 reimage (PID 2293117) is awaiting input [11:39:47] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.makevm for new host idp-test2005.wikimedia.org [11:39:49] !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox [11:40:06] !log mvernon@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2078 - mvernon@cumin1002" [11:40:06] !log mvernon@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:40:06] !log mvernon@cumin1002 START - Cookbook sre.dns.wipe-cache ms-be2078.codfw.wmnet 239.32.192.10.in-addr.arpa 9.3.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:40:09] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11254526 (10MoritzMuehlenhoff) The config shipped in Debian trixie is already following that scheme and very minimal, Debian only ships this: ` dnssec: # validation: process... [11:40:09] !log mvernon@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2078.codfw.wmnet 239.32.192.10.in-addr.arpa 9.3.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:40:10] !log mvernon@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2078 [11:40:15] (03PS1) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194592 (https://phabricator.wikimedia.org/T406318) [11:40:19] (03PS1) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194593 (https://phabricator.wikimedia.org/T406318) [11:40:30] (03PS1) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group1 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194594 (https://phabricator.wikimedia.org/T406318) [11:40:34] (03PS1) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194595 (https://phabricator.wikimedia.org/T406318) [11:40:42] (03PS1) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194596 (https://phabricator.wikimedia.org/T406318) [11:40:46] (03PS1) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194597 (https://phabricator.wikimedia.org/T406318) [11:40:50] (03PS1) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194598 (https://phabricator.wikimedia.org/T406318) [11:41:00] (03PS1) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194599 (https://phabricator.wikimedia.org/T406318) [11:41:48] (03CR) 10Hnowlan: [C:03+1] trafficserver: rest-gateway routes for rest.php: group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194592 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [11:41:52] (03CR) 10Hnowlan: [C:03+1] trafficserver: rest-gateway routes for rest.php: group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194593 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [11:41:56] (03CR) 10Hnowlan: [C:03+1] trafficserver: rest-gateway routes for rest.php: group1 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194594 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [11:42:07] (03CR) 10Hnowlan: [C:03+1] trafficserver: rest-gateway routes for rest.php: group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194595 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [11:42:13] Sorry wikibugs <3 [11:42:15] (03CR) 10Hnowlan: [C:03+1] trafficserver: rest-gateway routes for rest.php: group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194596 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [11:42:41] (03CR) 10Hnowlan: [C:03+1] trafficserver: rest-gateway routes for rest.php: group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194597 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [11:42:45] (03CR) 10Hnowlan: [C:03+1] trafficserver: rest-gateway routes for rest.php: group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194598 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [11:42:45] !log mvernon@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2078 [11:42:45] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2078 [11:42:49] (03CR) 10Hnowlan: [C:03+1] trafficserver: rest-gateway routes for rest.php: group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194599 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [11:43:17] (03CR) 10CI reject: [V:04-1] trafficserver: rest-gateway routes for rest.php: group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194593 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [11:43:22] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:43:30] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host idp-test2005.wikimedia.org [11:44:24] (03PS2) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194593 (https://phabricator.wikimedia.org/T406318) [11:44:28] (03CR) 10CI reject: [V:04-1] trafficserver: rest-gateway routes for rest.php: group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194598 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [11:44:32] (03CR) 10CI reject: [V:04-1] trafficserver: rest-gateway routes for rest.php: group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194599 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [11:45:13] (03PS2) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group1 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194594 (https://phabricator.wikimedia.org/T406318) [11:45:13] (03PS2) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194595 (https://phabricator.wikimedia.org/T406318) [11:45:13] (03PS2) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194596 (https://phabricator.wikimedia.org/T406318) [11:45:13] (03PS2) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194597 (https://phabricator.wikimedia.org/T406318) [11:45:14] (03PS2) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194598 (https://phabricator.wikimedia.org/T406318) [11:45:16] (03PS2) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194599 (https://phabricator.wikimedia.org/T406318) [11:47:00] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.makevm for new host idp-test2005.wikimedia.org [11:47:02] !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox [11:48:17] (03PS1) 10Gergő Tisza: jwt: Use core cookie settings [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194603 (https://phabricator.wikimedia.org/T406621) [11:48:42] (03PS1) 10Gergő Tisza: jwt: Use core cookie settings [extensions/CentralAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194604 (https://phabricator.wikimedia.org/T406621) [11:49:54] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] interface: new define for additional IPs [puppet] - 10https://gerrit.wikimedia.org/r/1191326 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi) [11:50:06] (03CR) 10Filippo Giunchedi: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1191326 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi) [11:50:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio100[1-3] - https://phabricator.wikimedia.org/T405983#11254654 (10Jclark-ctr) [11:50:30] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:50:36] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host idp-test2005.wikimedia.org [11:50:41] (03CR) 10Filippo Giunchedi: [C:03+2] wmcs: have additional IPs survive reboots [puppet] - 10https://gerrit.wikimedia.org/r/1191327 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi) [11:50:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio100[1-3] - https://phabricator.wikimedia.org/T405983#11254664 (10Jclark-ctr) 05Open→03Resolved a:05VRiley-WMF→03Jclark-ctr Installed 3ssd into slot 4 on each server [11:51:03] (03PS1) 10Gergő Tisza: Temporarily undeploy JWT session cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194605 (https://phabricator.wikimedia.org/T399631) [11:51:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194605 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [11:51:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194603 (https://phabricator.wikimedia.org/T406621) (owner: 10Gergő Tisza) [11:51:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194604 (https://phabricator.wikimedia.org/T406621) (owner: 10Gergő Tisza) [11:52:08] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams: MPC7E 3D 40XGE line card in slot 0 failure [Oct 2025] - https://phabricator.wikimedia.org/T406705#11254678 (10cmooney) Remote hands request CS3302125 has been raised to the Digital Realty staff on site in AMS9 Science Park. [11:54:04] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#11254703 (10Jclark-ctr) @Jgreen i have installed the 5th drive T405983 was there anything left on dcops side @VRiley-WMF [11:57:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11254725 (10Jclark-ctr) @BTullis can you assist with this [11:57:15] !log slyngshede@cumin1003 START - Cookbook sre.hosts.decommission for hosts idp-test2005.wikimedia.org [11:59:46] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [12:01:17] !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox [12:02:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1194579 (https://phabricator.wikimedia.org/T406148) (owner: 10Btullis) [12:04:18] (03PS1) 10D3r1ck01: Force OATHManage to be on central domain [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194607 (https://phabricator.wikimedia.org/T401773) [12:04:46] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2078.codfw.wmnet with reason: host reimage [12:05:02] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test2005.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1003" [12:05:26] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test2005.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1003" [12:05:26] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:05:27] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test2005.wikimedia.org [12:05:48] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.makevm for new host idp-test2005.wikimedia.org [12:05:49] !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox [12:06:47] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{dse-k8s-worker2002.codfw.wmnet} and (A:dse-k8s-master-codfw or A:dse-k8s-worker-codfw) [12:07:56] (03PS2) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194592 (https://phabricator.wikimedia.org/T406318) [12:07:57] (03PS3) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194593 (https://phabricator.wikimedia.org/T406318) [12:07:57] (03PS3) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group1 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194594 (https://phabricator.wikimedia.org/T406318) [12:07:57] (03PS3) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194595 (https://phabricator.wikimedia.org/T406318) [12:07:58] (03PS3) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194596 (https://phabricator.wikimedia.org/T406318) [12:08:01] (03PS3) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194597 (https://phabricator.wikimedia.org/T406318) [12:08:05] (03PS3) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194598 (https://phabricator.wikimedia.org/T406318) [12:08:09] (03PS3) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194599 (https://phabricator.wikimedia.org/T406318) [12:08:13] (03PS1) 10Clément Goubert: trafficserver: clean up testing exceptions [puppet] - 10https://gerrit.wikimedia.org/r/1194608 (https://phabricator.wikimedia.org/T406318) [12:08:17] (03PS1) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: enwiki 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194609 (https://phabricator.wikimedia.org/T406318) [12:08:21] (03PS1) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: enwiki 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194610 (https://phabricator.wikimedia.org/T406318) [12:08:25] (03PS1) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: enwiki 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194611 (https://phabricator.wikimedia.org/T406318) [12:08:53] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{dse-k8s-worker2002.codfw.wmnet} and (A:dse-k8s-master-codfw or A:dse-k8s-worker-codfw) [12:08:59] (03CR) 10Btullis: [C:03+2] Mimic the signing behaviour of the apt module for thirdparty/bigtop15 [puppet] - 10https://gerrit.wikimedia.org/r/1194579 (https://phabricator.wikimedia.org/T406148) (owner: 10Btullis) [12:09:24] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test2005.wikimedia.org - slyngshede@cumin1003" [12:09:28] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test2005.wikimedia.org - slyngshede@cumin1003" [12:09:28] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:09:28] !log slyngshede@cumin1003 START - Cookbook sre.dns.wipe-cache idp-test2005.wikimedia.org on all recursors [12:09:32] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp-test2005.wikimedia.org on all recursors [12:09:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194607 (https://phabricator.wikimedia.org/T401773) (owner: 10D3r1ck01) [12:10:02] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test2005.wikimedia.org - slyngshede@cumin1003" [12:10:06] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test2005.wikimedia.org - slyngshede@cumin1003" [12:10:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194150 (https://phabricator.wikimedia.org/T401773) (owner: 10Reedy) [12:10:24] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host idp-test2005.wikimedia.org with OS trixie [12:10:54] (03PS3) 10Hashar: Ease configuration of the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 [12:11:29] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406626#11254775 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Have rebalanced power [12:11:58] (03PS1) 10Elukey: services: reduce tegola's cronjob paralleism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194612 (https://phabricator.wikimedia.org/T381565) [12:12:45] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11254784 (10MatthewVernon) [12:13:20] (03PS1) 10Filippo Giunchedi: interface: ensure as string not bare word [puppet] - 10https://gerrit.wikimedia.org/r/1194613 (https://phabricator.wikimedia.org/T347681) [12:14:58] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11254798 (10MatthewVernon) @Jhancock.wm ms-be2083 and ms-be2084 are now ready to have their controllers swapped - can you do them, please? I've downtimed... [12:15:14] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ms-be[2083-2084].codfw.wmnet with reason: awaiting controller swap [12:15:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11254800 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=bc05300e-e772-4e9b-8bf7-cfc19155167c) set by mvernon@cumin2002 for 3 days, 0:... [12:16:44] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] "Trivial change on a new class, self merging" [puppet] - 10https://gerrit.wikimedia.org/r/1194613 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi) [12:18:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11254807 (10BTullis) >>! In T405943#11250829, @RobH wrote: > @btullis, > For all the items with 'should be fine to take down a single node at... [12:20:08] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194612 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [12:20:53] (03CR) 10Elukey: [C:03+2] services: reduce tegola's cronjob paralleism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194612 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [12:20:54] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11254809 (10elukey) Just added the send-invalidation tiles timer on maps2011, it should go through the imposm's cache and send invalidation events. After Tegola's pregen pods consume... [12:21:21] (03CR) 10Clément Goubert: [C:03+2] trafficserver: clean up testing exceptions [puppet] - 10https://gerrit.wikimedia.org/r/1194608 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [12:21:27] (03CR) 10Clément Goubert: [C:03+2] trafficserver: rest-gateway routes for rest.php: group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194592 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [12:22:23] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [12:22:48] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2078.codfw.wmnet with OS bullseye [12:22:54] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [12:22:55] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11254815 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host ms-be2078.codfw.wmnet with OS bullseye compl... [12:24:04] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [12:24:31] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [12:24:51] (03CR) 10Hashar: Fix link to task in the motd banner (031 comment) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194573 (owner: 10Hashar) [12:25:07] (03PS2) 10Hashar: Fix link to task in the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194573 [12:25:22] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [12:25:57] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [12:27:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11254829 (10BTullis) [12:28:12] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test2005.wikimedia.org with reason: host reimage [12:28:48] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit2003), No backups: 6 (puppetserver1001, ...), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:29:11] (03PS4) 10Hashar: Ease configuration of the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 [12:30:43] (03CR) 10Hashar: "Rebased and includes fixes by @tacsipacsi@jnet.hu made in parent patch Id1f4f0cd8d527504dac813e76fea32508c6e3bb2" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 (owner: 10Hashar) [12:33:33] (03CR) 10Milimetric: [C:03+2] "self-merging this as I've got approval from a couple of other people and it's not affecting other code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194175 (https://phabricator.wikimedia.org/T406359) (owner: 10Milimetric) [12:33:49] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test2005.wikimedia.org with reason: host reimage [12:34:20] jouncebot: nowandnext [12:34:20] No deployments scheduled for the next 0 hour(s) and 25 minute(s) [12:34:20] In 0 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T1300) [12:34:29] (03Merged) 10jenkins-bot: Configure a web_base_with_ip stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194175 (https://phabricator.wikimedia.org/T406359) (owner: 10Milimetric) [12:34:59] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#11254897 (10Jgreen) >>! In T367820#11254703, @Jclark-ctr wrote: > @Jgreen i have installed the 5th drive T405983 was there anything left on dcops side @VRiley-WMF Hey @VRiley-W... [12:36:43] (03PS10) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [12:37:01] (03PS4) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194593 (https://phabricator.wikimedia.org/T406318) [12:37:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11254903 (10BTullis) >>! In T405943#11254807, @BTullis wrote: > For all the items with 'should be fine to take down a single node at a time' d... [12:39:39] (03CR) 10Slyngshede: P:cache::haproxy copy private repo data (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [12:39:41] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [12:40:54] (03CR) 10Clément Goubert: [C:03+2] trafficserver: rest-gateway routes for rest.php: group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194593 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [12:43:29] (03CR) 10Lucas Werkmeister (WMDE): Revert "Invalidate Flow cache on enwiktionary" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194588 (owner: 10Esanders) [12:44:54] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:45:36] !log derick@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=fywiki --logwiki=metawiki Constable31 Shogeneral # T406731 [12:45:40] T406731: Unblock stuck global rename of Shogeneral - https://phabricator.wikimedia.org/T406731 [12:46:33] (03PS1) 10Slyngshede: site.pp clean up idp-test configuration [puppet] - 10https://gerrit.wikimedia.org/r/1194618 (https://phabricator.wikimedia.org/T406455) [12:48:29] (03CR) 10Muehlenhoff: [C:03+1] site.pp clean up idp-test configuration [puppet] - 10https://gerrit.wikimedia.org/r/1194618 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [12:48:47] FYI, group0 rest.php is now routed at 100% through the rest-gateway (or at least the CR is merged and puppet is deploying it) [12:48:56] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] "I was wrong in the diagnosis of this problem: the unit was not enabled because it failed to start out of the box, at the next puppet run i" [puppet] - 10https://gerrit.wikimedia.org/r/1194613 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi) [12:49:08] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp-test2005.wikimedia.org with OS trixie [12:49:08] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp-test2005.wikimedia.org [12:50:50] (03CR) 10Slyngshede: [C:03+2] site.pp clean up idp-test configuration [puppet] - 10https://gerrit.wikimedia.org/r/1194618 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [12:51:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11254965 (10BTullis) [12:53:25] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-09-30-194529 to 2025-10-06-215412 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194619 (https://phabricator.wikimedia.org/T380964) [12:53:31] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-09-25-181720 to 2025-10-06-225918 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194620 (https://phabricator.wikimedia.org/T380964) [12:57:00] (03PS2) 10Gergő Tisza: Temporarily undeploy JWT session cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194605 (https://phabricator.wikimedia.org/T399631) [12:57:00] (03PS1) 10Gergő Tisza: Deploy JWT session cookies to group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194622 (https://phabricator.wikimedia.org/T399631) [12:57:28] (03PS2) 10Gergő Tisza: Deploy JWT session cookies to group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194622 (https://phabricator.wikimedia.org/T399631) [12:58:02] (03PS1) 10Tiziano Fogli: check_gdnsd_checkconf: enable nrpe wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1184469 (https://phabricator.wikimedia.org/T384425) [12:58:03] (03CR) 10Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1184469 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [12:58:09] (03PS7) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [12:59:37] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [12:59:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11255005 (10BTullis) >>! In T405943#11251466, @Jclark-ctr wrote: > @BTullis For an-test-master1002 we need to failover to it self when we mov... [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T1300). [13:00:05] pcoombe, edsanders, tgr, and xSavitar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:43] I'm here :) [13:01:00] o/ [13:01:12] o/ [13:01:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194622 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [13:01:30] o/ [13:02:36] (03PS1) 10D3r1ck01: SharedDomainHookHandler: Remove WebAuthn sitenotice [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 [13:02:50] (03PS1) 10D3r1ck01: SharedDomainHookHandler: Remove WebAuthn sitenotice [extensions/CentralAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194624 [13:02:59] edsanders: are you there? (I also left a minor comment on your config change) [13:03:00] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker2002.codfw.wmnet [13:03:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:03:21] I’m thinking we can probably deploy the config changes by pcoombe and edsanders together [13:03:25] (03PS7) 10Ssingh: haptcha: add new role for hCaptcha proxy [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) [13:03:39] or start with just pcoombe if ed isn’t here yet [13:03:41] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681#11255018 (10fgiunchedi) 05Open→03Resolved This is done; NFS s... [13:04:29] tgr_: should your config change and backports be deployed together or separately? [13:04:46] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:05:08] let’s start with pcoombe’s donatewiki patch [13:05:12] (03CR) 10Ssingh: "@vgutierrez@wikimedia.org: I can review this but I think it will be good if the final word came from you. When you have a chance, please l" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [13:05:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194278 (https://phabricator.wikimedia.org/T406638) (owner: 10Pcoombe) [13:05:36] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:06:07] RECOVERY - Host cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 80.85 ms [13:06:10] (03Merged) 10jenkins-bot: Disable mobilefrontend on donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194278 (https://phabricator.wikimedia.org/T406638) (owner: 10Pcoombe) [13:06:39] <_joe_> oh hello cr1-esams, how are you [13:06:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:06:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:07:04] PROBLEM - Host 2a02:ec80:300:1:185:15:59:2 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:04] PROBLEM - Host 2a02:ec80:300:2:185:15:59:34 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:05] “There were unexpected commits pulled from origin for /srv/mediawiki-staging” yay [13:07:08] (https://spiderpig.wikimedia.org/jobs/718) [13:07:37] cc milimetric per commit author ^ [13:07:39] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:08:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:08:12] the change is apparently https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1194175 [13:08:17] merged half an hour ago [13:08:26] Lucas_WMDE: sorry - what happened? [13:08:26] Lucas_WMDE: either together, or the config change first [13:08:27] but not deployed? [13:08:39] (03PS1) 10Federico Ceratto: es2052.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1194625 (https://phabricator.wikimedia.org/T402859) [13:08:45] ah - I was just about to try and find out how to deploy this thing [13:08:45] milimetric: I’m trying to figure it out myself, but AFAICT scap is complaining that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1194175 hadn’t been deployed yet [13:08:46] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:08:53] (I thought it was automatic, my bad) [13:08:58] well it’s going to roll out with my deployment now [13:09:03] can you hang around to test it when it’s on mwdebug? [13:09:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [13:09:07] of course, thank you [13:09:10] ok thanks [13:09:16] continuing with deplyoment then [13:09:16] RECOVERY - Host 2a02:ec80:300:2:185:15:59:34 is UP: PING OK - Packet loss = 0%, RTA = 80.26 ms [13:09:20] RECOVERY - Host 2a02:ec80:300:1:185:15:59:2 is UP: PING OK - Packet loss = 0%, RTA = 80.28 ms [13:09:21] tgr_: ack [13:09:25] any preference? ^^ [13:09:26] RECOVERY - Host cr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 80.54 ms [13:09:36] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:09:37] no, either is fine [13:09:40] ok [13:10:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker2002.codfw.wmnet [13:10:10] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1194278|Disable mobilefrontend on donatewiki (T406638)]] [13:10:14] T406638: Disable mobilefrontend on donatewiki - https://phabricator.wikimedia.org/T406638 [13:10:17] * Lucas_WMDE tries to understand the changes [13:10:23] RESOLVED: GnmiTargetDown: cr1-esams is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [13:10:30] ok I think I see the point [13:10:33] might as well do them together then [13:10:36] (03CR) 10CI reject: [V:04-1] SharedDomainHookHandler: Remove WebAuthn sitenotice [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 (owner: 10D3r1ck01) [13:11:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:11:36] (03PS1) 10Federico Ceratto: instances.yaml: add es2052 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1194626 (https://phabricator.wikimedia.org/T402859) [13:11:51] RESOLVED: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:11:56] (03CR) 10D3r1ck01: "`" [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 (owner: 10D3r1ck01) [13:11:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bullseye [13:12:00] (03CR) 10D3r1ck01: "recheck" [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 (owner: 10D3r1ck01) [13:12:25] (03CR) 10Ssingh: "Thanks for the patch. Is there some documentation on how the wrapper works? I am interested in it purely for my own understanding. Happy t" [puppet] - 10https://gerrit.wikimedia.org/r/1184469 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [13:12:39] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:13:16] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194603 (https://phabricator.wikimedia.org/T406621) (owner: 10Gergő Tisza) [13:13:20] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194604 (https://phabricator.wikimedia.org/T406621) (owner: 10Gergő Tisza) [13:13:22] Lucas_WMDE, would it make sense to +2 mine ahead deployment to save deployer time? :) They can wait if doing that will cause trouble/be counterproductive. [13:13:36] xSavitar: I just did that for tgr_’s backports [13:13:44] I would do yours separately, so wait a bit before +2ing them [13:13:49] unless they should happen together [13:13:58] Sure! All yours :) [13:14:08] No, they can happen after Tgr's [13:14:13] ok thanks :) [13:15:10] (also edsanders’ config change might happen in between) [13:15:49] (03Merged) 10jenkins-bot: jwt: Use core cookie settings [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194603 (https://phabricator.wikimedia.org/T406621) (owner: 10Gergő Tisza) [13:15:51] (03Merged) 10jenkins-bot: jwt: Use core cookie settings [extensions/CentralAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194604 (https://phabricator.wikimedia.org/T406621) (owner: 10Gergő Tisza) [13:16:10] wow, those gate-and-submits finished quick [13:16:15] (03CR) 10Ottomata: [C:03+2] sqoop - fix centralauth - use seperate script and add to sqoop-whole-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1193926 (https://phabricator.wikimedia.org/T389666) (owner: 10Ottomata) [13:16:18] scap hasn’t even finished building the images for the current deploy yet [13:16:20] they are for different things so the order doesn't matter [13:16:27] but AFAIK it should be fine, nothing git pulls automatically [13:16:32] the CentralAuth patches I mean [13:16:59] so you can merge them ahead if you don't mind scap freaking out about unknown patches [13:17:29] (03CR) 10Vgutierrez: check_gdnsd_checkconf: enable nrpe wrapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184469 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [13:17:36] I’d still feel more comfortable deploying them separately :) [13:17:42] hopefully we’ll have enough time without running into the next window [13:18:20] oh, I should also ask if you want to self-service ^^ [13:18:22] either of you, even [13:18:29] (03PS8) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [13:18:40] (and then if you really want to you can deploy them together and I won’t stop you :P) [13:18:49] Lucas_WMDE: should I be able to see my change on debug now? still looks like mobilefrontend is enabled to me. sorry, first time doing one of these [13:18:56] pcoombe: nope, not yet [13:19:06] ok thanks [13:19:12] not sure why the image build is taking so long tbh [13:19:25] Lucas_WMDE, ah don't worry, you can do mine. I'll standby and test things for sure. [13:19:46] ok [13:19:48] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker2001.codfw.wmnet [13:20:08] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:21:48] (03CR) 10D3r1ck01: SharedDomainHookHandler: Remove WebAuthn sitenotice (031 comment) [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 (owner: 10D3r1ck01) [13:24:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [13:24:19] (03CR) 10Tiziano Fogli: "Thanks for the routing-related info." [puppet] - 10https://gerrit.wikimedia.org/r/1184469 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [13:25:27] (03PS2) 10D3r1ck01: SharedDomainHookHandler: Remove WebAuthn sitenotice [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 [13:26:30] (03PS3) 10Gergő Tisza: SharedDomainHookHandler: Remove WebAuthn sitenotice [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 (owner: 10D3r1ck01) [13:26:31] * Lucas_WMDE is confused about the docker images being built [13:26:50] we seemingly have 81, 83, 81-next *and* 83-next images [13:27:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker2001.codfw.wmnet [13:27:18] * Lucas_WMDE rescinds earlier hopefulness about there being enough time for a lot of deployments [13:27:37] “Waiting 300 seconds for swift after full mediawiki image build (T390251)” [13:27:44] why’s it doing a full mediawiki image build 😩 [13:27:47] that explains why it was so slow at least [13:27:56] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [13:28:37] (03CR) 10Gergő Tisza: SharedDomainHookHandler: Remove WebAuthn sitenotice (031 comment) [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 (owner: 10D3r1ck01) [13:28:40] (03PS2) 10Tiziano Fogli: check_gdnsd_checkconf: enable nrpe wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1184469 (https://phabricator.wikimedia.org/T384425) [13:28:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [13:29:02] (03CR) 10D3r1ck01: SharedDomainHookHandler: Remove WebAuthn sitenotice (031 comment) [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 (owner: 10D3r1ck01) [13:29:12] “541 languages rebuilt out of 541” in rebuildLocalisationCache.php [13:29:14] no idea why though [13:29:36] (03CR) 10D3r1ck01: SharedDomainHookHandler: Remove WebAuthn sitenotice (031 comment) [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 (owner: 10D3r1ck01) [13:29:49] (03PS9) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [13:30:39] (03CR) 10Ssingh: "Thanks, that's very helpful." [puppet] - 10https://gerrit.wikimedia.org/r/1184469 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [13:30:51] (03CR) 10D3r1ck01: SharedDomainHookHandler: Remove WebAuthn sitenotice (031 comment) [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 (owner: 10D3r1ck01) [13:31:04] (03CR) 10Muehlenhoff: [C:03+1] "Ship it" [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [13:31:15] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:31:41] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams: MPC7E 3D 40XGE line card in slot 0 failure [Oct 2025] - https://phabricator.wikimedia.org/T406705#11255104 (10cmooney) Looks like the card re-seat did the trick: ` cmooney@re0.cr1-esams> show chassis fpc 0 detail Slot 0 information: State... [13:32:07] (03CR) 10Mooeypoo: "Yeah, it's a good point -- we're aware, but it's still VERY useful alerts to have us be notified of on Slack (and other teams, potentially" [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse) [13:32:22] ok it’s finally syncing [13:32:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:32:53] * milimetric is following along with popcorn [13:34:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [13:37:56] mwdebug k8s is at 75% [13:38:46] (03CR) 10Tiziano Fogli: "Icinga uses two different intervals: the check interval and the retry interval. The first is used when everything is working fine, while t" [puppet] - 10https://gerrit.wikimedia.org/r/1184469 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [13:39:12] !log lucaswerkmeister-wmde@deploy2002 pcoombe, lucaswerkmeister-wmde: Backport for [[gerrit:1194278|Disable mobilefrontend on donatewiki (T406638)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:39:15] T406638: Disable mobilefrontend on donatewiki - https://phabricator.wikimedia.org/T406638 [13:39:48] pcoombe, milimetric: please test on WikimediaDebug :) [13:39:56] roger [13:40:03] thanks, checking now [13:40:10] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194607 (https://phabricator.wikimedia.org/T401773) (owner: 10D3r1ck01) [13:40:13] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194150 (https://phabricator.wikimedia.org/T401773) (owner: 10Reedy) [13:40:29] ^ I’ll let the CI timing decide if tgr_ and xSavitar’s backports get deployed together or separately [13:40:47] reminder to self and others, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1194605 *must* be merged as well before the next deployment [13:40:50] 10ops-esams, 06SRE, 06DC-Ops: esams: access switches fans blowing the wrong way - https://phabricator.wikimedia.org/T406734 (10cmooney) 03NEW p:05Triage→03Medium [13:40:53] (given that the backports which require this config change were already merged) [13:41:40] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:52] Lucas_WMDE: looks good to me [13:41:53] Lucas_WMDE: looks good to me [13:42:00] !log lucaswerkmeister-wmde@deploy2002 pcoombe, lucaswerkmeister-wmde: Continuing with sync [13:42:13] suspiciously similar messages ;) [13:42:14] thanks! [13:42:17] we're bots [13:42:47] no thank you! Sorry for the delay in deploy [13:43:59] thank you! [13:45:23] I guess the normal deploy will also be slower because everything has to pull a larger docker image [13:48:34] (03PS1) 10Tiziano Fogli: nrpewrapper: correlate Prometheus "for:" duration with timer interval [puppet] - 10https://gerrit.wikimedia.org/r/1194632 (https://phabricator.wikimedia.org/T395446) [13:50:22] (03Merged) 10jenkins-bot: Force OATHManage to be on central domain [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194607 (https://phabricator.wikimedia.org/T401773) (owner: 10D3r1ck01) [13:50:23] (03Merged) 10jenkins-bot: Force OATHManage to be on central domain [extensions/CentralAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194150 (https://phabricator.wikimedia.org/T401773) (owner: 10Reedy) [13:50:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bullseye [13:51:54] (03PS10) 10Btullis: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:51:59] (03PS2) 10DLynch: Launch VisualEditor EditCheck paste check a/b test to 22 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194334 (https://phabricator.wikimedia.org/T405422) [13:52:41] (03CR) 10Tiziano Fogli: [C:03+2] nrpewrapper: correlate Prometheus "for:" duration with timer interval [puppet] - 10https://gerrit.wikimedia.org/r/1194632 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [13:52:58] (03PS1) 10Andrew Bogott: wmcs backups: exclude cinder volumes in query-service project [puppet] - 10https://gerrit.wikimedia.org/r/1194635 (https://phabricator.wikimedia.org/T406240) [13:53:02] (03CR) 10Tiziano Fogli: [C:03+2] "I'm self-merging since I forgot to update systems::job::timer in the previously submitted patch (https://gerrit.wikimedia.org/r/c/operatio" [puppet] - 10https://gerrit.wikimedia.org/r/1194632 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [13:53:18] ok, CI has decreed that it’s all getting deployed together :) [13:54:10] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment (the I15bad21976 backports, which I also +2ed and which ended up being merged already, must n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194605 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [13:54:33] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194278|Disable mobilefrontend on donatewiki (T406638)]] (duration: 44m 23s) [13:54:37] T406638: Disable mobilefrontend on donatewiki - https://phabricator.wikimedia.org/T406638 [13:55:05] (03Merged) 10jenkins-bot: Temporarily undeploy JWT session cookies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194605 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [13:55:14] (03PS11) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [13:56:11] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1194605|Temporarily undeploy JWT session cookies (T399631)]], [[gerrit:1194603|jwt: Use core cookie settings (T406621)]], [[gerrit:1194604|jwt: Use core cookie settings (T406621)]], [[gerrit:1194607|Force OATHManage to be on central domain (T401773)]], [[gerrit:1194150|Force OATHManage to be on central domain (T401773)]] [13:56:19] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [13:56:19] T406621: Session cookie JWTs of SUL and non-SUL wikis conflict - https://phabricator.wikimedia.org/T406621 [13:56:20] T401773: Always redirect 2FA management special page to auth domain on SUL wikis, so that WebAuthn setup can be offered - https://phabricator.wikimedia.org/T401773 [13:56:53] “9 languages rebuilt out of 541” yay [13:56:57] *0 lol [13:57:10] also, interesting, it’s apparently 540 in wmf.21 and 541 in wmf.22. guess mediawiki gained another language :) [13:57:25] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11255228 (10Jelto) [13:57:58] (03PS1) 10Elukey: profile::amd_gpu: add initial support for the k8s node labeller [puppet] - 10https://gerrit.wikimedia.org/r/1194639 (https://phabricator.wikimedia.org/T373806) [13:58:34] (probably T406198) [13:58:35] T406198: Add Bono (abr) to Names.php - https://phabricator.wikimedia.org/T406198 [13:59:08] Lucas_WMDE: I'll check with swfrench-wmf later today, we shouldn't be building 81-next [13:59:25] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7232/co" [puppet] - 10https://gerrit.wikimedia.org/r/1194639 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:59:32] ack, thanks [13:59:56] (I was very close to writing something like “I’m sure claime will explain to me in a moment why this is good and correct” :P) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T1400) [14:00:06] Lucas_WMDE: Heh [14:00:07] (but felt like it might’ve been a bit creepy ^^) [14:00:16] still deploying, sorry wikifunctioneers :( [14:00:22] Stop trying to predict me, I'm WILD [14:00:24] first deployment in the window unexpectedly required a full rebuild [14:01:12] !log lucaswerkmeister-wmde@deploy2002 d3r1ck01, lucaswerkmeister-wmde, reedy, tgr: Backport for [[gerrit:1194605|Temporarily undeploy JWT session cookies (T399631)]], [[gerrit:1194603|jwt: Use core cookie settings (T406621)]], [[gerrit:1194604|jwt: Use core cookie settings (T406621)]], [[gerrit:1194607|Force OATHManage to be on central domain (T401773)]], [[gerrit:1194150|Force OATHManage to be on central domain (T401773) [14:01:12] ]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:01:13] (03CR) 10Dr0ptp4kt: "Apologies for delay, firefighting ate my homework. Response included on question." [puppet] - 10https://gerrit.wikimedia.org/r/1193437 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [14:01:21] tgr_, xSavitar: please test :) [14:01:27] * xSavitar testing... [14:02:02] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-09-30-194529 to 2025-10-06-215412 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194619 (https://phabricator.wikimedia.org/T380964) (owner: 10Jforrester) [14:02:55] Lucas_WMDE, my tests hows things seem to work correctly. [14:02:58] *shows [14:03:01] ok [14:03:23] is tgr_ still here? or does anyone else know how to test that JWT session cookies are undeployed? [14:03:32] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11255293 (10elukey) >>! In T394357#11225371, @elukey wrote: > @Jhancock.wm me and Jesse are running out of ideas, if you have time could you please open the host and check if the... [14:03:33] (I’m guessing the backports themselves aren’t testable given that cookies should be undeployed) [14:03:40] Lucas_WMDE: I'm confident disabling works [14:03:42] (03CR) 10CDanis: [C:03+1] mesh.configuration: Envoy config updates for 1.29 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191722 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [14:03:47] !log lucaswerkmeister-wmde@deploy2002 d3r1ck01, lucaswerkmeister-wmde, reedy, tgr: Continuing with sync [14:03:47] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-09-30-194529 to 2025-10-06-215412 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194619 (https://phabricator.wikimedia.org/T380964) (owner: 10Jforrester) [14:03:48] fair enough [14:03:52] reenabling will be the interesting part [14:03:55] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11255295 (10Jelto) Deployment-wise everything is done, gitlab artifacts and packages use the object storage backend now. I also up... [14:03:57] heh [14:04:04] yeah, good luck during the late window :) [14:04:13] thanks for deploying! [14:04:15] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:04:32] (03PS5) 10Federico Ceratto: major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) [14:04:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11255300 (10VRiley-WMF) Okay, was looking at this issue a bit. There are currently two fiber cables involved with this process. After g... [14:04:56] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:05:21] tgr_, seems we're out of time to deploy the WebAuthn site notice backports? Cc Lucas_WMDE [14:05:34] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:05:42] aren't those part of the bundle? [14:06:06] tgr_, it wasn't +2'd yet, so not part of the bundle. [14:06:19] The sitenotice is still there but if it's quick enough, we can deploy it [14:06:22] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:06:39] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:07:25] I guess it won't interfere with the wikifunctions service deployment [14:07:38] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:07:42] so why not [14:08:01] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-09-25-181720 to 2025-10-06-225918 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194620 (https://phabricator.wikimedia.org/T380964) (owner: 10Jforrester) [14:08:08] it's not a big deal if it's left for the evening, though [14:08:11] I don't know if it will. Maybe James_F can give us a green light otherwise we can do it in the late window today [14:08:17] Sure, keep going. [14:08:29] It won't affect us, and we have no MW-land changes for todayt. [14:08:32] Oh thanks James_F. Will deploy that one now [14:08:52] !log re-pool esams in dns after cr1-esams restored to normal operation T406705 [14:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:56] T406705: cr1-esams: MPC7E 3D 40XGE line card in slot 0 failure [Oct 2025] - https://phabricator.wikimedia.org/T406705 [14:09:11] Once Lucas_WMDE is done syncing [14:09:18] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for EMcFarland - https://phabricator.wikimedia.org/T406739 (10EMcFarland-WMF) 03NEW [14:09:19] !log cmooney@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site esams [reason: cr1-esams is back online and working after card re-seat, T406705] [14:09:22] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site esams [reason: cr1-esams is back online and working after card re-seat, T406705] [14:09:33] ok [14:09:54] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:09:57] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-09-25-181720 to 2025-10-06-225918 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194620 (https://phabricator.wikimedia.org/T380964) (owner: 10Jforrester) [14:10:11] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194605|Temporarily undeploy JWT session cookies (T399631)]], [[gerrit:1194603|jwt: Use core cookie settings (T406621)]], [[gerrit:1194604|jwt: Use core cookie settings (T406621)]], [[gerrit:1194607|Force OATHManage to be on central domain (T401773)]], [[gerrit:1194150|Force OATHManage to be on central domain (T401773)]] (duration: 14m 0 [14:10:12] 0s) [14:10:17] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [14:10:18] T406621: Session cookie JWTs of SUL and non-SUL wikis conflict - https://phabricator.wikimedia.org/T406621 [14:10:18] T401773: Always redirect 2FA management special page to auth domain on SUL wikis, so that WebAuthn setup can be offered - https://phabricator.wikimedia.org/T401773 [14:10:21] xSavitar: over to you [14:10:34] Ack! Thanks Lucas_WMDE [14:10:43] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:11:05] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:11:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 (owner: 10D3r1ck01) [14:11:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194624 (owner: 10D3r1ck01) [14:11:21] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:11:53] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:12:06] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:12:40] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:13:50] (03PS1) 10Federico Ceratto: preseed.yaml: Remove es2053 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1194643 (https://phabricator.wikimedia.org/T402859) [14:13:52] (03PS1) 10Federico Ceratto: es2053.yaml: Prepare es2053 for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1194644 (https://phabricator.wikimedia.org/T402859) [14:15:20] xSavitar: it might be a good idea to remove the en.json and qqq.json changes from those backports, tbh [14:15:25] otherwise the deploy will take ages again [14:15:29] (I just looked at the changes now, sorry) [14:15:40] you can still remove them on master where it actually matters [14:16:09] Lucas_WMDE, so I interrupt the running process for now? [14:16:37] that would be my suggestion [14:16:42] Ack! [14:16:42] if it sounds sensible to you [14:16:48] I’m in a meeting and only half paying attention [14:16:51] so don’t just take my word for it ^^ [14:17:07] (03CR) 10Ssingh: [C:03+2] haptcha: add new role for hCaptcha proxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [14:17:42] Lucas_WMDE, I'll just re-schedule this for late window so I can fix anything that needs fixing. [14:17:57] Just wondering if en.json changes would trigger rebuilding all languages? [14:19:17] I'll move on, actually, there is enough time :) [14:19:22] (03CR) 10Arnaudb: [C:03+1] Ease configuration of the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 (owner: 10Hashar) [14:19:23] not sure if it triggers rebuilding *all* languages [14:19:30] but any i18n changes mess up deployments afaik [14:19:35] doesn’t matter how many there are [14:19:35] (03Merged) 10jenkins-bot: SharedDomainHookHandler: Remove WebAuthn sitenotice [extensions/CentralAuth] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194623 (owner: 10D3r1ck01) [14:19:47] ok, gate-and-submit finished, you’re stuck with it now ;) [14:19:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194624 (owner: 10D3r1ck01) [14:19:53] (03Merged) 10jenkins-bot: SharedDomainHookHandler: Remove WebAuthn sitenotice [extensions/CentralAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194624 (owner: 10D3r1ck01) [14:19:55] (03CR) 10Arnaudb: [C:03+1] Fix link to task in the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194573 (owner: 10Hashar) [14:20:27] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1194623|SharedDomainHookHandler: Remove WebAuthn sitenotice]], [[gerrit:1194624|SharedDomainHookHandler: Remove WebAuthn sitenotice]] [14:20:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-eqiad: fan failure on left tray [Oct 2025] - https://phabricator.wikimedia.org/T406554#11255372 (10cmooney) 05Open→03Resolved Thanks @Jclark-ctr. As you say it seems the one that has gone in is the same model as came out.... [14:21:26] (03PS1) 10Ssingh: site.pp: switch hcaptcha1001 to role hcaptcha_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1194646 (https://phabricator.wikimedia.org/T405631) [14:21:33] (this will probably run into the xLab window then) [14:21:40] (03CR) 10Elukey: profile::thanos: fix xlab SLI's recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193437 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [14:23:48] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams: MPC7E 3D 40XGE line card in slot 0 failure [Oct 2025] - https://phabricator.wikimedia.org/T406705#11255395 (10cmooney) 05Open→03Resolved esams has been re-pooled and traffic levels have returned to normal for the site. closing this task now,... [14:24:12] (03CR) 10Ssingh: [C:03+1] "OK that makes sense, thanks for sharing the snippet above." [puppet] - 10https://gerrit.wikimedia.org/r/1184469 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [14:24:36] (03CR) 10Ladsgroup: [C:03+1] es2052.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1194625 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [14:24:45] (03CR) 10Ladsgroup: [C:03+1] instances.yaml: add es2052 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1194626 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [14:27:12] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11255415 (10jcrespo) All good to me to close an follow up later, but please let's merge T403946 asap (not asking you, I've been the... [14:28:13] claime: Lucas_WMDE: alas, "next" in the context of image builds means something entirely different from the "next" services we use for PHP upgrades [14:28:13] the former are single-version images running wmf/next, while normal multi-version images (the 83 flavour) are used in the "next" deployments at this time. [14:28:14] unfortunate nomenclature collision that developed concurrently =/ [14:28:26] oh [14:28:28] right [14:28:39] I see, thanks [14:28:55] Lucas_WMDE: so you were right, it was good and correct :D [14:29:00] Just wasn't me that said it [14:29:00] :D :D [14:29:12] it would help if the naming was better :) [14:29:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [14:29:34] maybe we should bite the rename bullet and call the single-version images well, singleversion [14:29:51] proposal: drop the -next from the pretrain images, and instead label the current suffixless images as -previous [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T1430) [14:32:04] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for EMcFarland - https://phabricator.wikimedia.org/T406739#11255447 (10DMburugu) I approve [14:33:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2043.codfw.wmnet with OS bullseye [14:34:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bullseye [14:34:08] btullis@cumin1003 reimage (PID 1838146) is awaiting input [14:34:58] jouncebot: nowandnext [14:34:58] For the next 0 hour(s) and 25 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T1400) [14:34:58] For the next 0 hour(s) and 25 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T1430) [14:34:58] In 2 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T1700) [14:38:00] (03PS12) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [14:43:02] (03CR) 10Ahmon Dancy: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1194156 (owner: 10Muehlenhoff) [14:46:23] (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: group1 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194594 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [14:47:50] !log derick@deploy2002 d3r1ck01, derick: Backport for [[gerrit:1194623|SharedDomainHookHandler: Remove WebAuthn sitenotice]], [[gerrit:1194624|SharedDomainHookHandler: Remove WebAuthn sitenotice]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:50:14] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [14:50:27] Tested and it looks good. [14:50:32] !log derick@deploy2002 d3r1ck01, derick: Continuing with sync [14:52:49] PROBLEM - SSH on stat1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:53:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [14:54:10] (03CR) 10Andrew Bogott: [C:03+2] wmcs backups: exclude cinder volumes in query-service project [puppet] - 10https://gerrit.wikimedia.org/r/1194635 (https://phabricator.wikimedia.org/T406240) (owner: 10Andrew Bogott) [14:54:38] RECOVERY - SSH on stat1009 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:59:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-launcher1003.eqiad.wmnet with OS bullseye [15:01:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:02:08] (03PS1) 10Muehlenhoff: Record LDAP access for emc-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1194653 [15:02:43] (03CR) 10Federico Ceratto: [C:03+2] es2052.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1194625 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [15:02:46] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: add es2052 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1194626 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [15:02:56] (03CR) 10Tacsipacsi: Fix link to task in the motd banner (031 comment) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194573 (owner: 10Hashar) [15:03:03] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194623|SharedDomainHookHandler: Remove WebAuthn sitenotice]], [[gerrit:1194624|SharedDomainHookHandler: Remove WebAuthn sitenotice]] (duration: 42m 36s) [15:03:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:04:03] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for emc-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1194653 (owner: 10Muehlenhoff) [15:05:07] !log UTC afternoon backport+config window do ne [15:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:18] (03CR) 10Tacsipacsi: Ease configuration of the motd banner (032 comments) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 (owner: 10Hashar) [15:06:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:07:45] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for EMcFarland - https://phabricator.wikimedia.org/T406739#11255573 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This was requested/completed via Wikimedia IDM, resolving. [15:08:05] (03PS1) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.11.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1194654 [15:08:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:44] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11255579 (10elukey) Trying to summarize the problem: * We know that the debian installer doesn't copy the EFI partition on all the d... [15:10:20] (03PS2) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.11.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1194654 [15:11:55] !log elukey@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2088.codfw.wmnet with reason: testing [15:12:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bullseye [15:12:40] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-launcher1003.eqiad.wmnet with reason: host reimage [15:14:04] !log elukey@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be1088.eqiad.wmnet with reason: testing [15:16:57] !log reboot ms-be1088 as a test for T404356 [15:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:00] T404356: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356 [15:18:22] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11255648 (10ssingh) Thanks @MoritzMuehlenhoff, that sounds like a good plan to me but leaving to @CDobbins for the final word. [15:18:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-launcher1003.eqiad.wmnet with reason: host reimage [15:19:18] 06SRE, 06serviceops: eqiad (2) memcached host for wikifunctions service implementation tracking - https://phabricator.wikimedia.org/T313965#11255650 (10kamila) 05Open→03Resolved a:03kamila Looks done to me, correct me if I'm wrong :-) [15:19:34] (03PS1) 10Hnowlan: rest-gateway: add wikibase rest.php match [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194655 (https://phabricator.wikimedia.org/T406318) [15:21:59] 10ops-codfw, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: wdqs2017: Apparent hardware issue, rack C2 - https://phabricator.wikimedia.org/T406609#11255669 (10Jhancock.wm) @bking i reseated the internal components and the server did boot. It did go into a BIOS update and is currently booted to the OS. I believ... [15:22:13] (03PS2) 10Anzx: eswiki, commonswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194650 (https://phabricator.wikimedia.org/T406655) [15:26:48] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: add wikibase rest.php match [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194655 (https://phabricator.wikimedia.org/T406318) (owner: 10Hnowlan) [15:27:29] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11255689 (10elukey) I checked ms-be1088's boot properties and the disk boot option is `debian(SATA,Port:0)`, that IIUC is being set b... [15:29:30] (03CR) 10Hnowlan: [C:03+2] rest-gateway: add wikibase rest.php match [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194655 (https://phabricator.wikimedia.org/T406318) (owner: 10Hnowlan) [15:30:37] (03CR) 10Cathal Mooney: [C:03+2] CHANGELOG: add changelogs for release v0.11.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1194654 (owner: 10Cathal Mooney) [15:31:16] (03Merged) 10jenkins-bot: rest-gateway: add wikibase rest.php match [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194655 (https://phabricator.wikimedia.org/T406318) (owner: 10Hnowlan) [15:33:40] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:33:47] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:34:14] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:34:21] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:35:58] (03PS4) 10Elukey: sre.hardware.upgrade-firmware: fix ssd upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) [15:37:10] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:37:23] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:39:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:40:50] (03PS5) 10Elukey: sre.hardware.upgrade-firmware: fix ssd upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) [15:41:13] (03CR) 10Elukey: sre.hardware.upgrade-firmware: fix ssd upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [15:41:18] (03PS2) 10Kosta Harlan: hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194666 (https://phabricator.wikimedia.org/T404204) [15:42:55] (03PS4) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194595 (https://phabricator.wikimedia.org/T406318) [15:45:44] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for EMcFarland - https://phabricator.wikimedia.org/T406739#11255749 (10Urbanecm_WMF) >>! In T406739#11255573, @MoritzMuehlenhoff wrote: > This was requested/completed via Wikimedia IDM, resolving. FTR, this task was created because the onboarding gui... [15:46:14] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11255751 (10elukey) Matthew told me that ms-be2078 can be used for testing the reimage with UEFI, it is a Dell node with Legacy setti... [15:46:15] (03PS1) 10Kosta Harlan: ConfirmEdit/hCaptcha: Implement automatic failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) [15:46:19] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.11.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1194654 (owner: 10Cathal Mooney) [15:47:30] (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194595 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [15:51:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-launcher1003.eqiad.wmnet with OS bullseye [15:52:47] 10ops-codfw, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: wdqs2017: Apparent hardware issue, rack C2 - https://phabricator.wikimedia.org/T406609#11255835 (10bking) a:05Jhancock.wm→03bking @Jhancock.wm Thanks, I'll grab this ticket and try a reimage. If that works, I'll go ahead and close this one out. [15:53:01] 10ops-codfw, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: wdqs2017: Apparent hardware issue, rack C2 - https://phabricator.wikimedia.org/T406609#11255838 (10bking) 05Open→03In progress p:05Triage→03Medium [15:53:35] (03CR) 10Dreamy Jazz: ConfirmEdit/hCaptcha: Implement automatic failover (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [15:53:59] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2017.codfw.wmnet with OS bullseye [15:54:42] (03CR) 10Dreamy Jazz: ConfirmEdit/hCaptcha: Implement automatic failover (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [15:56:56] (03CR) 10Dreamy Jazz: ConfirmEdit/hCaptcha: Implement automatic failover (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [15:57:47] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for EMcFarland - https://phabricator.wikimedia.org/T406739#11255866 (10MoritzMuehlenhoff) @EMcFarland-WMF Thanks for fixing! We had updated various parts of onboarding docs, but apparently that one was missed. The landing page for an overview of what... [15:58:23] (03PS2) 10Kosta Harlan: ConfirmEdit/hCaptcha: Implement automatic failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) [15:58:24] (03CR) 10Kosta Harlan: ConfirmEdit/hCaptcha: Implement automatic failover (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [15:58:58] (03PS1) 10Btullis: Disable istio-injection for the analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194673 (https://phabricator.wikimedia.org/T405490) [15:59:22] (03PS1) 10Fabfur: haproxy: try to parse also non utf8 characters [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427) [15:59:23] (03CR) 10Dreamy Jazz: ConfirmEdit/hCaptcha: Implement automatic failover (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [15:59:35] (03CR) 10CI reject: [V:04-1] ConfirmEdit/hCaptcha: Implement automatic failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [16:00:17] (03PS3) 10Kosta Harlan: ConfirmEdit/hCaptcha: Implement automatic failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) [16:00:28] (03CR) 10Dreamy Jazz: ConfirmEdit/hCaptcha: Implement automatic failover (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [16:00:36] (03CR) 10Dreamy Jazz: ConfirmEdit/hCaptcha: Implement automatic failover (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [16:01:56] (03CR) 10Dreamy Jazz: ConfirmEdit/hCaptcha: Implement automatic failover (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [16:02:12] (03CR) 10CI reject: [V:04-1] haproxy: try to parse also non utf8 characters [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427) (owner: 10Fabfur) [16:03:29] (03PS4) 10Kosta Harlan: ConfirmEdit/hCaptcha: Implement automatic failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) [16:03:32] (03CR) 10Kosta Harlan: ConfirmEdit/hCaptcha: Implement automatic failover (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [16:03:39] (03CR) 10Kosta Harlan: ConfirmEdit/hCaptcha: Implement automatic failover (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [16:04:09] (03PS1) 10Cathal Mooney: K8s reverse DNS delegation: remove wikikube-ctrl1001 and add new nets [dns] - 10https://gerrit.wikimedia.org/r/1194678 (https://phabricator.wikimedia.org/T383227) [16:04:16] (03PS5) 10Kosta Harlan: ConfirmEdit/hCaptcha: Implement automatic failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) [16:05:34] (03CR) 10Dreamy Jazz: [C:03+1] ConfirmEdit/hCaptcha: Implement automatic failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [16:06:03] (03CR) 10Dreamy Jazz: [C:03+1] ConfirmEdit/hCaptcha: Implement automatic failover (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [16:06:37] (03PS6) 10Kosta Harlan: ConfirmEdit/hCaptcha: Implement automatic failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) [16:06:39] (03CR) 10Kosta Harlan: ConfirmEdit/hCaptcha: Implement automatic failover (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [16:07:11] (03CR) 10BCornwall: [C:03+1] site.pp: switch hcaptcha1001 to role hcaptcha_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1194646 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [16:07:20] (03CR) 10Dreamy Jazz: [C:03+1] ConfirmEdit/hCaptcha: Implement automatic failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [16:10:17] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2017.codfw.wmnet with reason: host reimage [16:13:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2017.codfw.wmnet with reason: host reimage [16:19:40] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1194646 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [16:21:52] (03PS2) 10Fabfur: haproxy: try to parse also non utf8 characters [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427) [16:26:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Add es2052 T402859', diff saved to https://phabricator.wikimedia.org/P83675 and previous config saved to /var/cache/conftool/dbconfig/20251008-162623-fceratto.json [16:26:28] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [16:28:17] (03PS1) 10Andrew Bogott: Updates to build v2025.10.06 for Debian Trixie [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1194687 (https://phabricator.wikimedia.org/T406522) [16:28:36] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 163889112 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:28:40] (03PS6) 10LorenMora: Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) [16:28:48] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit2003), No backups: 6 (puppetserver1001, ...), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [16:29:27] (03CR) 10LorenMora: Add ReadingList Stream to EventStreamConfig (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [16:30:36] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 18200 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:30:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2017.codfw.wmnet with OS bullseye [16:35:23] (03PS2) 10Andrew Bogott: Updates to build v2025.10.06 for Debian Trixie [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1194687 (https://phabricator.wikimedia.org/T406522) [16:36:38] !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for es2052.codfw.wmnet [16:36:38] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2052.codfw.wmnet [16:37:23] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2052 gradually with 4 steps - Pooling in new host [16:42:27] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1002.eqiad.wmnet with reason: WIP [16:43:09] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1001.eqiad.wmnet with reason: WIP [16:43:58] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2001.codfw.wmnet with reason: WIP [16:44:54] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:46:54] (03CR) 10JHathaway: [C:03+2] civicrm: set postfix relay host to wikimedia's mx-out [puppet] - 10https://gerrit.wikimedia.org/r/1194298 (https://phabricator.wikimedia.org/T406278) (owner: 10JHathaway) [16:53:15] (03PS1) 10Bking: wdqs-internal-scholarly: add wdqs2017 [puppet] - 10https://gerrit.wikimedia.org/r/1194696 (https://phabricator.wikimedia.org/T405978) [16:53:38] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2002.codfw.wmnet with reason: WIP [16:53:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194696 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [16:55:28] (03PS1) 10Jcrespo: gerrit: Disable gerrit2003 backups [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) [16:55:52] (03CR) 10Dzahn: [C:03+1] gerrit: Disable gerrit2003 backups [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [16:55:55] (03CR) 10CI reject: [V:04-1] gerrit: Disable gerrit2003 backups [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [16:56:29] (03PS2) 10Jcrespo: gerrit: Disable gerrit2003 backups [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) [16:56:31] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [16:56:59] (03CR) 10CI reject: [V:04-1] gerrit: Disable gerrit2003 backups [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [16:57:04] (03CR) 10Dzahn: gerrit: Disable gerrit2003 backups [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [16:57:22] (03PS3) 10Jcrespo: gerrit: Disable gerrit2003 backups [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) [16:57:30] (03CR) 10Dzahn: "confirmed gerrit2003 is still the "spare_host" as of today" [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [16:57:56] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [16:58:07] (03CR) 10CI reject: [V:04-1] gerrit: Disable gerrit2003 backups [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [16:58:09] (03CR) 10jenkins-bot: gerrit: Disable gerrit2003 backups [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [16:58:18] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1193832 (owner: 10Hashar) [16:58:29] 06SRE, 06Data-Engineering: Set up a working, usable dbt installation on stat boxes - https://phabricator.wikimedia.org/T406634#11256150 (10BTullis) [16:58:37] (03PS4) 10Dzahn: gerrit: Disable gerrit2003 backups [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [16:59:20] (03CR) 10Dzahn: [C:03+1] gerrit: Disable gerrit2003 backups [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [16:59:34] (03PS11) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [17:00:01] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [17:00:05] swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T1700). [17:00:41] o/ [17:01:04] I'll get started in a few minutes [17:01:08] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:03:51] (03CR) 10Scott French: "Thanks for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192954 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:03:55] (03CR) 10Scott French: [C:03+2] mw-*: Tune 8.3 releases to prevent deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192954 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:04:30] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1194697/7233/gerrit2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [17:04:45] (03CR) 10Jcrespo: [C:03+2] gerrit: Disable gerrit2003 backups [puppet] - 10https://gerrit.wikimedia.org/r/1194697 (https://phabricator.wikimedia.org/T406762) (owner: 10Jcrespo) [17:06:21] (03CR) 10Ssingh: [C:03+2] site.pp: switch hcaptcha1001 to role hcaptcha_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1194646 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [17:06:52] (03Merged) 10jenkins-bot: mw-*: Tune 8.3 releases to prevent deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192954 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:06:56] (03PS12) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [17:08:11] (03PS7) 10Aaron Schulz: Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) [17:09:13] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:09:22] (03PS1) 10Ssingh: hiera: hcaptcha1001: set realserver::pools [puppet] - 10https://gerrit.wikimedia.org/r/1194700 (https://phabricator.wikimedia.org/T405631) [17:09:23] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:09:24] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:09:33] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:09:35] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:09:43] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:09:44] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:09:54] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:10:06] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:10:19] (03CR) 10Ssingh: "Should have been in the previous commit but oh well." [puppet] - 10https://gerrit.wikimedia.org/r/1194700 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [17:10:36] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:10:43] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:10:44] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:10:50] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:10:51] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:10:52] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [17:10:56] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:10:58] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:11:04] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:12:18] (03CR) 10Scott French: [C:03+2] mw-*: Right-size large service after switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194256 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:13:41] (03PS13) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [17:15:04] (03Merged) 10jenkins-bot: mw-*: Right-size large service after switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194256 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:15:16] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:17:15] 06SRE, 06Data-Engineering (Q1 FY25/26 July 1st - September 30th): Set up a working, usable dbt installation on stat boxes - https://phabricator.wikimedia.org/T406634#11256271 (10Ahoelzl) [17:18:10] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [17:18:18] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [17:19:07] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [17:20:18] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:20:41] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:22:32] (03PS14) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [17:22:54] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2052 gradually with 4 steps - Pooling in new host [17:23:53] (03CR) 10Dzahn: gerrit: mod_qos tweaks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193597 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [17:24:15] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:25:10] (03CR) 10Ladsgroup: "I haven't tested it but the logic and the order looks correct to me." [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [17:26:28] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:26:43] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:27:28] (03PS1) 10Dzahn: gerrit: increase QS_ClientPrefer threshold [puppet] - 10https://gerrit.wikimedia.org/r/1194702 [17:28:07] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194702" [puppet] - 10https://gerrit.wikimedia.org/r/1193597 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [17:28:28] (03CR) 10Ssingh: [C:03+2] hiera: hcaptcha1001: set realserver::pools [puppet] - 10https://gerrit.wikimedia.org/r/1194700 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [17:31:30] (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Enable unified mobile routing on en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192272 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [17:31:37] !! [17:32:09] !log Enable unified mobile routing on en.wikipedia.org - T403510 [17:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:15] T403510: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [17:32:30] I'll be doing it slowly and checking, so it won't be immediate [17:32:47] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:33:05] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:33:29] (03PS2) 10Esanders: Revert "Invalidate Flow cache on enwiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194588 [17:34:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194588 (owner: 10Esanders) [17:34:42] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha1001.wikimedia.org with OS bookworm [17:39:13] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:39:30] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:41:40] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:43] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2005-dev.codfw.wmnet with OS bookworm [17:42:46] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:42:59] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:43:30] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:43:38] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:44:10] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:44:22] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:44:53] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:44:54] (03CR) 10Dzahn: [V:03+1 C:03+2] zuul: reduce code duplication for new zuul setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193958 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:45:06] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:45:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11256390 (10RobH) [17:45:50] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1194306/7235/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1194306 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:46:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11256391 (10RobH) [17:46:07] (03CR) 10Dzahn: [V:03+1 C:03+2] zuul: create class and systemd unit for new zuul-web service [puppet] - 10https://gerrit.wikimedia.org/r/1194306 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:47:24] (03CR) 10Btullis: [C:03+2] Disable istio-injection for the analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194673 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [17:47:45] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha1001.wikimedia.org with reason: host reimage [17:49:16] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:49:26] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:49:58] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:50:03] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:50:34] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:50:42] (03CR) 10Dzahn: [C:03+1] gerrit: Switchover gerrit1003 → gerrit2003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188351 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [17:50:44] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:51:12] (03CR) 10Aaron Schulz: Route old /api/rest_v1/?specs endpoints to static JSON files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [17:51:15] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:51:25] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:51:29] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1020:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:54:31] !log completed post-switchover right-sizing of large mediawiki services - T405955 [17:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:35] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:54:59] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha1001.wikimedia.org with reason: host reimage [17:55:18] (03CR) 10Btullis: [C:03+1] "I can't see why the PCC build failed, but +1 in principle." [puppet] - 10https://gerrit.wikimedia.org/r/1194696 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [17:56:53] (03Merged) 10jenkins-bot: Disable istio-injection for the analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194673 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [17:58:29] (03CR) 10Dzahn: "the compiler fails because host " wdqs2017.codfw.wmnet" was skipped (fail fast). this is most likely because the host is new and the compi" [puppet] - 10https://gerrit.wikimedia.org/r/1194696 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [17:59:25] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage [18:00:30] brett: 🎉 [18:00:42] \o/ [18:01:13] (03CR) 10Dzahn: "https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Updating_nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1194696 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [18:04:08] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage [18:07:14] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:07:18] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:08:12] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:08:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:09:54] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:15:28] (03PS1) 10Dzahn: zuul: run zuul-web services as zuul user [puppet] - 10https://gerrit.wikimedia.org/r/1194710 (https://phabricator.wikimedia.org/T405119) [18:16:18] (03PS2) 10Dzahn: zuul: run zuul-web services as zuul user [puppet] - 10https://gerrit.wikimedia.org/r/1194710 (https://phabricator.wikimedia.org/T405119) [18:16:58] (03CR) 10Dzahn: [C:03+2] zuul: run zuul-web services as zuul user [puppet] - 10https://gerrit.wikimedia.org/r/1194710 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [18:17:01] (03PS1) 10Subramanya Sastry: Revert "Add a DOM version of the TOC markers pass" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194712 [18:22:30] (03PS1) 10Cathal Mooney: Release v0.11.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1194713 [18:23:03] is the window open for a backport? ... ^^ there. [18:23:09] (03CR) 10Bking: [C:03+2] "Thanks @dzahn@wikimedia.org! Merging..." [puppet] - 10https://gerrit.wikimedia.org/r/1194696 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [18:24:42] (03CR) 10Dzahn: "it's ok to merge multiple and my patch as well" [puppet] - 10https://gerrit.wikimedia.org/r/1194696 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [18:25:06] (03PS3) 10Clément Goubert: Handle transform/wikitext/to/lint(.*) requests routed to the gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189938 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [18:25:14] jouncebot: now [18:25:15] No deployments scheduled for the next 1 hour(s) and 34 minute(s) [18:25:16] jouncebot: nowandnext [18:25:17] No deployments scheduled for the next 1 hour(s) and 34 minute(s) [18:25:17] In 1 hour(s) and 34 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T2000) [18:25:22] (03PS1) 10Ssingh: O:hcaptcha_proxy: include profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/1194714 (https://phabricator.wikimedia.org/T405631) [18:25:28] subbu: there is a bot to query [[Deployments]] :] [18:25:49] (03CR) 10C. Scott Ananian: [C:03+1] Revert "Add a DOM version of the TOC markers pass" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194712 (owner: 10Subramanya Sastry) [18:26:03] ah, right .. i forgot. [18:26:04] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7237/co" [puppet] - 10https://gerrit.wikimedia.org/r/1194714 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [18:26:25] ok .. i can try spiderpig for the very first time ... anyone around to rescue me if I mess something up? [18:26:31] well unless you are a frequent deployer, I guess it is hard to remember about everything [18:26:43] go for it, I am around :] [18:26:44] (03CR) 10Cathal Mooney: [C:03+2] Release v0.11.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1194713 (owner: 10Cathal Mooney) [18:26:46] ok, ty. [18:27:01] (03CR) 10Aaron Schulz: [C:03+1] "Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189938 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [18:27:12] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on wdqs2017.codfw.wmnet with reason: finish getting host ready for production [18:27:30] (03CR) 10Hashar: [C:03+1] Revert "Add a DOM version of the TOC markers pass" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194712 (owner: 10Subramanya Sastry) [18:27:42] (03CR) 10BCornwall: [C:03+1] O:hcaptcha_proxy: include profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/1194714 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [18:28:10] 10ops-codfw, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: wdqs2017: Apparent hardware issue, rack C2 - https://phabricator.wikimedia.org/T406609#11256510 (10bking) 05In progress→03Resolved The host reimaged successfully. Closing... [18:28:16] (03CR) 10Ssingh: [V:03+1 C:03+2] O:hcaptcha_proxy: include profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/1194714 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [18:29:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ssastry@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194712 (owner: 10Subramanya Sastry) [18:29:51] okay .. i kicked it off in spiderpig ... ^^ [18:30:20] the awesome thing is that I can watch the job progress/logs from my browser :] [18:30:44] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release vX.Y.Z - cmooney@cumin1003 [18:32:11] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha1001.wikimedia.org with OS bookworm [18:32:35] (03CR) 10Aaron Schulz: [C:03+1] "I'd also like to get this one out soon :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189938 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [18:33:12] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release vX.Y.Z - cmooney@cumin1003 [18:34:14] (03PS1) 10Ssingh: site.pp: reimage all hcaptcha nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/1194715 (https://phabricator.wikimedia.org/T405631) [18:34:22] subbu: have you seen the reverted change was part of a chain? There is a child change https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1193498/ that might require it. [18:34:53] 06SRE, 07CSS, 13Patch-For-Review: Update the errorpage template to use flex - https://phabricator.wikimedia.org/T392692#11256566 (10Pppery) [18:34:58] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha1001.wikimedia.org with OS bookworm [18:35:35] hmm .. let me check. [18:36:13] (I have also posted a quick summary on https://phabricator.wikimedia.org/T406749#11256575 ) [18:36:24] !log Enable unified mobile routing on en.wikipedia.org rollout complete - T403510 [18:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:28] T403510: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [18:36:44] 🎉🎉🎉 [18:37:05] nah, i think it is fine. they are unrelated changes. [18:37:16] great! [18:39:03] brett: subbu: did you got rid of the .m. subdomains? [18:39:51] (03Merged) 10jenkins-bot: Revert "Add a DOM version of the TOC markers pass" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194712 (owner: 10Subramanya Sastry) [18:39:59] (03PS1) 10Scott French: mw-(api-ext|web): Scale next releases to 10% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194716 (https://phabricator.wikimedia.org/T405955) [18:40:07] (03PS1) 10Cathal Mooney: Homer: update our base config file to include 'selectors' for nokia [puppet] - 10https://gerrit.wikimedia.org/r/1194717 (https://phabricator.wikimedia.org/T402511) [18:40:08] hashar, i don't understand ... [18:40:10] hashar: They should be bye [18:40:13] bye-bye [18:40:13] (03PS1) 10Scott French: Enroll 1% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194718 (https://phabricator.wikimedia.org/T405955) [18:40:28] !log ssastry@deploy2002 Started scap sync-world: Backport for [[gerrit:1194712|Revert "Add a DOM version of the TOC markers pass"]] [18:40:33] subbu: sorry I wanted to ping sukhe :/ [18:40:33] (oh probably a message not for me, but b.rett) [18:40:44] apparently I can only read the first two letters of each nicknames [18:40:55] hashar: I had nothing to do with it, but yes, it's going away :) that's Krinkle and brett [18:41:07] rather, they, since all are going away but enwiki today, so it [18:43:02] (03CR) 10Ssingh: [C:03+1] "Without knowing much and going by the commit message, looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1194717 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [18:43:05] !log For posterity: October 8th 2025. The day brett and Krinkle are getting rid of the last .m. subdomain. [18:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:14] brett: that is quite an achievement, kudos to both of you :] [18:43:29] thank you :) [18:43:34] hashar: mobile devices now get the mobile version directly on the standard domain. The m-dot still works. We might start redirecting m-dot next week for some wikis but it'll always work for compat. [18:43:35] (03PS1) 10Dzahn: zuul: add port mapping for port 9000 for zuul-web service [puppet] - 10https://gerrit.wikimedia.org/r/1194719 (https://phabricator.wikimedia.org/T405119) [18:43:38] https://www.mediawiki.org/wiki/Requests_for_comment/Mobile_domain_sunsetting/2025_Announcement [18:43:42] (03CR) 10Cathal Mooney: [C:03+2] Homer: update our base config file to include 'selectors' for nokia [puppet] - 10https://gerrit.wikimedia.org/r/1194717 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [18:43:51] 🎆 [18:45:21] hmm [18:45:42] the k8s deployment looks stall at left: 12 [18:46:02] but if I head on K8S-debug and check https://nl.wiktionary.org/wiki/proscriptie?useskin=vector , it is shown as fixed already [18:46:17] ah progress [18:46:23] !log ssastry@deploy2002 ssastry: Backport for [[gerrit:1194712|Revert "Add a DOM version of the TOC markers pass"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:46:32] not sure why it spends 2:40 waiting to jump to left: 3 [18:46:45] I guess the progress is not linear [18:47:43] there is something funky with the cache maybe. Cause without mwdebug I got the proper version [18:47:52] I refreshed using &debug=1 and I got the wrong version again [18:47:58] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha1001.wikimedia.org with reason: host reimage [18:48:24] and now targetting https://nl.wiktionary.org/wiki/proscriptie?useskin=vector&debug=1 with k8s-debug, I got the faulty version with the escaped [18:48:43] (03CR) 10Dzahn: [C:03+2] zuul: add port mapping for port 9000 for zuul-web service [puppet] - 10https://gerrit.wikimedia.org/r/1194719 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [18:48:52] ok .. time to verify [18:49:58] I don't think I can help much on that front :/ [18:50:05] I am afk a bit [18:50:14] !log ssastry@deploy2002 ssastry: Continuing with sync [18:50:22] i verified, all good to go. [18:51:39] bummer ... we forgot to tag the revert with the phab task .. .anyway, will update it after it it is deployed. [18:52:56] !incidents [18:52:56] 6840 (RESOLVED) Host cr1-esams [18:52:56] 6842 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [18:52:56] 6841 (RESOLVED) ProbeDown sre (2a02:ec80:300:ed1a::1 ip6 text-https:443 probes/service http_text-https_ip6 esams) [18:53:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194650 (https://phabricator.wikimedia.org/T406655) (owner: 10Anzx) [18:53:17] (03CR) 10Bking: [C:03+1] "approved, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1192940 (https://phabricator.wikimedia.org/T406141) (owner: 10Ssingh) [18:53:32] (03CR) 10Ssingh: [C:03+2] team-sre/cdn: ignore (wdqs-main|wdqs-scholarly|wcqs).discovery.wmnet in ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1192940 (https://phabricator.wikimedia.org/T406141) (owner: 10Ssingh) [18:54:28] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha1001.wikimedia.org with reason: host reimage [18:54:52] sukhe: no worries. I posted a quick summary on the blocker task with links to the related Gerrit change (master + the two reverts) [18:55:25] hashar: I think you meant subbu :P [18:55:36] oh f** g** s*** [18:55:37] again [18:55:43] I apologize [18:56:11] * hashar renames everyone [18:56:29] !log ssastry@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194712|Revert "Add a DOM version of the TOC markers pass"]] (duration: 16m 00s) [18:58:22] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2005-dev.codfw.wmnet with OS bookworm [18:59:16] subbu: I still see them broken on https://nl.wiktionary.org/wiki/proscriptie?useskin=vector . I Guess some side effect with some caching? :/ [18:59:55] (03PS15) 10Bking: opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [19:00:06] gerrit seems to be struggling [19:01:20] tgr_: Daimona_ reported a couple hours ago that he had troubles connecting and was logged out constantly [19:01:20] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secret.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [19:01:26] (03PS1) 10Ssingh: conftool-data: add hcaptcha[12]00[12].wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1194722 (https://phabricator.wikimedia.org/T405631) [19:02:57] I didn't get logged out but I get ERR_CONNECTION_RESET most of the time [19:03:42] * hashar files a task AGAIN [19:04:52] hashar, yes .. caching probably. [19:06:00] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:06:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:06:59] tgr_: https://phabricator.wikimedia.org/T406774 [19:08:21] thx [19:10:38] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha1001.wikimedia.org with OS bookworm [19:11:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:15:31] (03PS1) 10Hashar: gerrit: disable mod_qos: make it log only [puppet] - 10https://gerrit.wikimedia.org/r/1194723 (https://phabricator.wikimedia.org/T406774) [19:16:00] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:18:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194562 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [19:18:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194563 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [19:19:49] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on remaining Wikipedias except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194562 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [19:19:51] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194563 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [19:20:26] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1194562|Disable wmgUseMdotRouting on remaining Wikipedias except enwiki (T403510)]], [[gerrit:1194563|Disable wmgUseMdotRouting on enwiki (T403510)]] [19:20:29] T403510: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [19:24:40] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1194562|Disable wmgUseMdotRouting on remaining Wikipedias except enwiki (T403510)]], [[gerrit:1194563|Disable wmgUseMdotRouting on enwiki (T403510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:25:30] T403510: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [19:25:48] !log krinkle@deploy2002 krinkle: Continuing with sync [19:29:12] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:29:52] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194562|Disable wmgUseMdotRouting on remaining Wikipedias except enwiki (T403510)]], [[gerrit:1194563|Disable wmgUseMdotRouting on enwiki (T403510)]] (duration: 09m 26s) [19:30:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194334 (https://phabricator.wikimedia.org/T405422) (owner: 10DLynch) [19:37:15] (03CR) 10JHathaway: [C:03+1] sre.hardware.upgrade-firmware: fix ssd upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [19:53:58] (03CR) 10Scott French: [C:03+2] gerrit: disable mod_qos: make it log only [puppet] - 10https://gerrit.wikimedia.org/r/1194723 (https://phabricator.wikimedia.org/T406774) (owner: 10Hashar) [19:58:13] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11256799 (10CDanis) BTW, after looking at a few weeks of data, I suggest increasing the failure sampling fraction for these services.... [19:59:45] Info: /Stage[main]/Profile::Gerrit::Proxy/Httpd::Conf[qos]/File[/etc/apache2/conf-available/50-qos.conf]: Scheduling refresh of Service[apache2] [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T2000). [20:00:05] tgr, anzx, and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] o/ [20:00:51] o/ [20:01:40] I can deploy [20:02:07] I can also deploy. [20:02:27] !log Disabled Gerrit Apache mod_qos by putting it to be logging only # T406774 [20:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:31] T406774: Gerrit connection troubles and ERR_CONNECTION_RESET - https://phabricator.wikimedia.org/T406774 [20:03:10] can probably just do the three patches together? [20:03:15] none of them seem too exciting [20:03:35] I'm fine with it. [20:04:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194622 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [20:04:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194650 (https://phabricator.wikimedia.org/T406655) (owner: 10Anzx) [20:04:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194334 (https://phabricator.wikimedia.org/T405422) (owner: 10DLynch) [20:04:41] (03CR) 10CI reject: [V:04-1] Deploy JWT session cookies to group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194622 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [20:05:12] oh well [20:05:35] (03Merged) 10jenkins-bot: eswiki, commonswiki: lift IP cap for workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194650 (https://phabricator.wikimedia.org/T406655) (owner: 10Anzx) [20:05:38] (03Merged) 10jenkins-bot: Launch VisualEditor EditCheck paste check a/b test to 22 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194334 (https://phabricator.wikimedia.org/T405422) (owner: 10DLynch) [20:05:43] (03CR) 10TrainBranchBot: "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194334 (https://phabricator.wikimedia.org/T405422) (owner: 10DLynch) [20:06:12] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1194650|eswiki, commonswiki: lift IP cap for workshop (T406655)]], [[gerrit:1194334|Launch VisualEditor EditCheck paste check a/b test to 22 wikis (T405422)]] [20:06:18] T406655: Lift IP cap on these dates 2025-10-20 and 2025-10-21 for edit-a-thon for eswiki and commons - https://phabricator.wikimedia.org/T406655 [20:06:18] T405422: Deploy config change to start the Paste Check A/B Test - https://phabricator.wikimedia.org/T405422 [20:07:26] (03PS3) 10Gergő Tisza: Deploy JWT session cookies to group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194622 (https://phabricator.wikimedia.org/T399631) [20:07:58] That really doesn't look like it should have experienced a merge conflict. [20:08:41] yeah git had no problem auto-merging it [20:08:58] gerrit can be weird about that kind of thing [20:11:18] !log tgr@deploy2002 tgr, kemayo, anzx: Backport for [[gerrit:1194650|eswiki, commonswiki: lift IP cap for workshop (T406655)]], [[gerrit:1194334|Launch VisualEditor EditCheck paste check a/b test to 22 wikis (T405422)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:11:24] T406655: Lift IP cap on these dates 2025-10-20 and 2025-10-21 for edit-a-thon for eswiki and commons - https://phabricator.wikimedia.org/T406655 [20:11:25] T405422: Deploy config change to start the Paste Check A/B Test - https://phabricator.wikimedia.org/T405422 [20:11:32] do you want to test it? [20:12:43] (03Abandoned) 10Jforrester: mwdebug: Change various uses to mw-on-k8s version [puppet] - 10https://gerrit.wikimedia.org/r/1051344 (owner: 10Jforrester) [20:12:46] (03Abandoned) 10Jforrester: mwdebug: Drop mwdebug\d{4} for bare metal servers [puppet] - 10https://gerrit.wikimedia.org/r/1051345 (https://phabricator.wikimedia.org/T367949) (owner: 10Jforrester) [20:14:46] tgr_: I just have, and it looks good. [20:14:50] (03PS1) 10Bearloga: EventStreamConfig: fix IP auto reveal stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194733 [20:15:03] !log tgr@deploy2002 tgr, kemayo, anzx: Continuing with sync [20:19:16] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194650|eswiki, commonswiki: lift IP cap for workshop (T406655)]], [[gerrit:1194334|Launch VisualEditor EditCheck paste check a/b test to 22 wikis (T405422)]] (duration: 13m 03s) [20:19:21] T406655: Lift IP cap on these dates 2025-10-20 and 2025-10-21 for edit-a-thon for eswiki and commons - https://phabricator.wikimedia.org/T406655 [20:19:21] T405422: Deploy config change to start the Paste Check A/B Test - https://phabricator.wikimedia.org/T405422 [20:19:52] (03CR) 10TrainBranchBot: "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194622 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [20:20:44] (03Merged) 10jenkins-bot: Deploy JWT session cookies to group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194622 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [20:20:44] (03PS1) 10JHathaway: sshd: use the default KexAlgorithms algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1194734 [20:21:17] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1194622|Deploy JWT session cookies to group2 (T399631)]] [20:21:21] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [20:21:24] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194734 (owner: 10JHathaway) [20:21:50] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11256897 (10hashar) @CDanis & @Dzahn thank you very much for adding the NEL. That has proven helpful to investigate an issue we had... [20:24:56] (03PS2) 10JHathaway: sshd: use the default KexAlgorithms algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1194734 [20:25:23] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194734 (owner: 10JHathaway) [20:26:43] !log tgr@deploy2002 tgr: Backport for [[gerrit:1194622|Deploy JWT session cookies to group2 (T399631)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:26:46] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [20:27:57] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11256913 (10CDobbins) Thanks for the clarification. That sounds good, @MoritzMuehlenhoff. I have no objections to implementing this. [20:28:43] (03PS1) 10DCausse: flink-operator: align mem settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194735 (https://phabricator.wikimedia.org/T405361) [20:30:06] (03CR) 10RLazarus: [C:03+2] mesh: Copy configuration_1.14.1 to 1.14.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191721 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [20:31:04] !log tgr@deploy2002 tgr: Continuing with sync [20:32:12] (03PS16) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [20:32:43] (03PS1) 10Cwhite: alertmanager: add sec team route to slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1194736 [20:32:57] (03Merged) 10jenkins-bot: mesh: Copy configuration_1.14.1 to 1.14.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191721 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [20:33:29] (03PS17) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [20:34:57] (03CR) 10RLazarus: [C:03+2] "Thanks both!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191722 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [20:35:06] (03CR) 10CI reject: [V:04-1] mesh.configuration: Envoy config updates for 1.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191722 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [20:35:11] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194622|Deploy JWT session cookies to group2 (T399631)]] (duration: 13m 53s) [20:35:14] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [20:35:21] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:35:36] (03PS3) 10RLazarus: mesh.configuration: Envoy config updates for 1.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191722 (https://phabricator.wikimedia.org/T404036) [20:35:45] (03CR) 10Cwhite: alertmanager: add sec team route to slack channel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194736 (owner: 10Cwhite) [20:36:34] !log UTC late deploys done [20:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:36] (03CR) 10RLazarus: mesh.configuration: Envoy config updates for 1.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191722 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [20:37:40] (03CR) 10RLazarus: [C:03+2] mesh.configuration: Envoy config updates for 1.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191722 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [20:39:11] (03PS3) 10JHathaway: sshd: use the default KexAlgorithms algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1194734 [20:39:15] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194734 (owner: 10JHathaway) [20:39:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:40:00] (03PS10) 10Jforrester: Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [20:40:15] (03Merged) 10jenkins-bot: mesh.configuration: Envoy config updates for 1.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191722 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [20:40:27] (03CR) 10Jforrester: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [20:41:38] (03CR) 10Hashar: "Thanks, I wrote the summary T406774#11256868" [puppet] - 10https://gerrit.wikimedia.org/r/1194723 (https://phabricator.wikimedia.org/T406774) (owner: 10Hashar) [20:42:54] (03CR) 10Ladsgroup: "Some config seems to be changed: https://integration.wikimedia.org/ci/job/operations-mw-config-php81-composer-diffConfig/2183/console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [20:44:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:44:54] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:53:45] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:55:56] (03PS18) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [20:57:41] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:58:02] marostegui@cumin1003 clone_es (PID 1785807) is awaiting input [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T2100) [21:07:44] (03CR) 10Ryan Kemper: [C:03+2] wdqs: move wdqs1018 back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1194341 (owner: 10Ryan Kemper) [21:08:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11257052 (10BCornwall) @RobH Thanks for soliciting the feedback! The cp hosts depooling schedule is fine. For DNS, we would prefer to depool these as well rather than unplug them live.... [21:10:55] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1018.eqiad.wmnet with OS bullseye [21:13:10] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@fea7794]: deploy to fresh internal-scholarly host T405978 [21:13:14] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [21:13:23] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@fea7794]: deploy to fresh internal-scholarly host T405978 (duration: 00m 12s) [21:18:54] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T405978, transfer to freshly reimaged host) xfer scholarly_articles from wdqs2016.codfw.wmnet -> wdqs2017.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [21:18:59] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [21:19:22] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) (T405978, transfer to freshly reimaged host) xfer scholarly_articles from wdqs2016.codfw.wmnet -> wdqs2017.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [21:19:28] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T405978, transfer to freshly reimaged host) xfer scholarly_articles from wdqs2016.codfw.wmnet -> wdqs2017.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [21:21:35] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11257102 (10MoritzMuehlenhoff) >>! In T389333#11256913, @CDobbins wrote: > Thanks for the clarification. That sounds good, @MoritzMuehlenhoff. I have no objections to implementi... [21:25:08] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1019.eqiad.wmnet with OS bullseye [21:32:59] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:36:49] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:58] (03PS19) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [21:43:19] (03CR) 10Santiago Faci: [C:03+1] Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [21:49:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1020:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [21:50:23] (03PS1) 10RLazarus: all charts: Update mesh.configuration 1.14.1 to 1.14.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194742 (https://phabricator.wikimedia.org/T404036) [21:50:25] (03PS1) 10RLazarus: kartotherian, tegola-vector-tiles: Remove unused tcp_health_check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194743 (https://phabricator.wikimedia.org/T404036) [21:51:44] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1020:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:53:32] (03PS1) 10Daimona Eaytoy: Change CampaignEvents user rights for all small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194744 (https://phabricator.wikimedia.org/T401445) [21:53:36] (03CR) 10BCornwall: [C:03+1] site.pp: reimage all hcaptcha nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/1194715 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [21:53:45] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [21:54:04] (03CR) 10BCornwall: [C:03+1] conftool-data: add hcaptcha[12]00[12].wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1194722 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251008T2200) [22:00:26] (03CR) 10BPirkle: [C:03+1] Route old /api/rest_v1/?specs endpoints to static JSON files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [22:01:52] ryankemper@cumin2002 reimage (PID 1448924) is awaiting input [22:09:54] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:09:54] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T405978, transfer to freshly reimaged host) xfer scholarly_articles from wdqs2016.codfw.wmnet -> wdqs2017.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [22:09:58] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [22:11:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:12:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:13:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:13:28] (03CR) 10Daimona Eaytoy: "Used some regex-matching to confirm that:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194744 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [22:13:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194744 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [22:16:43] ryankemper@cumin2002 reimage (PID 1455130) is awaiting input [22:16:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:19:49] RESOLVED: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1020:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:21:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:22:39] (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [22:22:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:23:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:24:36] (03PS20) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [22:26:47] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:26:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:27:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:30:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:41:17] (03CR) 10Jforrester: Avoid using wikitech dblist in configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [22:46:18] (03CR) 10Dr0ptp4kt: profile::thanos: fix xlab SLI's recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193437 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [22:51:28] marostegui@cumin1003 clone_es (PID 1790296) is awaiting input [22:54:10] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for a-pizzata - https://phabricator.wikimedia.org/T406328#11257321 (10Ahoelzl) Dear SRE, I'd appreciate if you could expedite this ticket. Thank you! [22:55:36] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for JavierMonton - https://phabricator.wikimedia.org/T406331#11257324 (10Ahoelzl) Dear SRE, I'd appreciate if you could expedite this ticket. Thank you! [23:00:31] (03PS2) 10Dr0ptp4kt: profile::thanos: fix xlab SLI's recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1193437 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [23:04:11] (03PS1) 10Jforrester: tests: Remove usage of ReflectionProperty::setAccessible(), no-op [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194749 (https://phabricator.wikimedia.org/T406744) [23:05:20] (03PS18) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [23:05:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:07:36] (03CR) 10Scott French: [C:03+1] "Thanks for the solid commit message!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194742 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [23:09:06] (03CR) 10RLazarus: [C:03+2] all charts: Update mesh.configuration 1.14.1 to 1.14.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194742 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [23:10:00] (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [23:10:57] (03PS19) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [23:11:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:20:09] (03Merged) 10jenkins-bot: all charts: Update mesh.configuration 1.14.1 to 1.14.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194742 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [23:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:28:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [23:29:54] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:34] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q2:rack/setup/install logging-sd200[567] - https://phabricator.wikimedia.org/T406795 (10RobH) 03NEW [23:31:42] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q2:rack/setup/install logging-sd200[567] - https://phabricator.wikimedia.org/T406795#11257437 (10RobH) a:03colewhite @colewhite, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-opera... [23:32:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796 (10RobH) 03NEW [23:33:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11257458 (10RobH) a:03colewhite @colewhite, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-opera... [23:33:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q2:rack/setup/install logging-sd100[567] - https://phabricator.wikimedia.org/T406796#11257463 (10RobH) [23:33:47] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q2:rack/setup/install logging-sd200[567] - https://phabricator.wikimedia.org/T406795#11257465 (10RobH) [23:38:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1194782 [23:38:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1194782 (owner: 10TrainBranchBot) [23:42:45] jouncebot: nowandnext [23:42:45] No deployments scheduled for the next 6 hour(s) and 17 minute(s) [23:42:45] In 6 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T0600) [23:42:46] In 6 hour(s) and 17 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T0600) [23:45:04] (03PS1) 10RLazarus: mesh: Copy configuration_1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194783 (https://phabricator.wikimedia.org/T404036) [23:45:06] (03PS1) 10RLazarus: mesh.configuration: Fix a typo in the OTel service_name template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194784 (https://phabricator.wikimedia.org/T404036) [23:45:13] FYI, please hold for a few minutes if you're considering deploying a service in wikikube with helmfile. we're making a couple of changes to envoy configs that require a wee bit of coordination. thanks! [23:49:38] (03PS1) 10Ryan Kemper: wdqs: bring wdqs101[8-9] into svc [puppet] - 10https://gerrit.wikimedia.org/r/1194785 (https://phabricator.wikimedia.org/T405978) [23:50:06] (03CR) 10Scott French: [C:03+1] mesh.configuration: Fix a typo in the OTel service_name template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194784 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [23:50:26] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1018.eqiad.wmnet with reason: host reimage [23:50:27] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1019.eqiad.wmnet with reason: host reimage [23:50:47] (03CR) 10Ryan Kemper: [C:03+1] wdqs: bring wdqs101[8-9] into svc [puppet] - 10https://gerrit.wikimedia.org/r/1194785 (https://phabricator.wikimedia.org/T405978) (owner: 10Ryan Kemper) [23:50:51] (03CR) 10Ryan Kemper: [C:03+2] wdqs: bring wdqs101[8-9] into svc [puppet] - 10https://gerrit.wikimedia.org/r/1194785 (https://phabricator.wikimedia.org/T405978) (owner: 10Ryan Kemper) [23:51:03] (03CR) 10RLazarus: [C:03+2] mesh: Copy configuration_1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194783 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [23:53:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1194782 (owner: 10TrainBranchBot) [23:54:12] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1018.eqiad.wmnet with reason: host reimage [23:55:10] (03Merged) 10jenkins-bot: mesh: Copy configuration_1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194783 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [23:58:29] (03CR) 10RLazarus: [C:03+2] mesh.configuration: Fix a typo in the OTel service_name template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194784 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [23:58:33] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1019.eqiad.wmnet with reason: host reimage