[00:00:27] (03Merged) 10jenkins-bot: mesh.configuration: Fix a typo in the OTel service_name template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194784 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [00:00:30] (03PS1) 10RLazarus: all charts: Update mesh.configuration 1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194786 (https://phabricator.wikimedia.org/T404036) [00:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1194787 [00:08:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1194787 (owner: 10TrainBranchBot) [00:08:53] PROBLEM - dump of s3 in codfw on backupmon1001 is CRITICAL: dump for s3 at codfw (db2239) taken more than a week ago: Most recent backup 2025-09-30 00:00:14 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:14:01] 06SRE, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th): Set up a working, usable dbt installation on stat boxes - https://phabricator.wikimedia.org/T406634#11257524 (10Ahoelzl) [00:16:51] (03CR) 10CI reject: [V:04-1] all charts: Update mesh.configuration 1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194786 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [00:19:08] (03CR) 10RLazarus: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194786 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [00:30:27] PROBLEM - dump of s1 in codfw on backupmon1001 is CRITICAL: dump for s1 at codfw (db2141) taken more than a week ago: Most recent backup 2025-09-30 00:00:14 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:33:45] (03CR) 10Scott French: [C:03+1] all charts: Update mesh.configuration 1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194786 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [00:34:21] (03CR) 10RLazarus: [C:03+2] all charts: Update mesh.configuration 1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194786 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [00:44:02] (03Merged) 10jenkins-bot: all charts: Update mesh.configuration 1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194786 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [00:44:54] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:48:48] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1194787 (owner: 10TrainBranchBot) [00:48:55] helmfile deployments are all clear again 👍 [01:00:50] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:04:56] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406799 (10phaultfinder) 03NEW [01:15:10] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 14m 20s) [01:19:57] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406799#11257695 (10phaultfinder) [01:36:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:41:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:51:44] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1020:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:54:12] FIRING: [6x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:54:20] !log [wdqs1020:~] $ sudo systemctl restart wdqs-blazegraph [01:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:23] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1018 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:54:23] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1018 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:54:23] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:54:23] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:54:48] FIRING: PuppetFailure: Puppet has failed on wdqs1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:57:17] FIRING: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:58:07] PROBLEM - dump of s4 in codfw on backupmon1001 is CRITICAL: dump for s4 at codfw (db2239) taken more than a week ago: Most recent backup 2025-09-30 01:36:23 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:58:47] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1019 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:58:47] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1019 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:58:47] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:58:47] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:59:12] FIRING: [11x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:59:48] FIRING: [2x] PuppetFailure: Puppet has failed on wdqs1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:01:29] FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:02:17] FIRING: [4x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:02:26] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release 20251008 [02:04:58] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406799#11257749 (10phaultfinder) [02:06:29] FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:09:54] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:11:29] FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:11:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: security release 20251008 [02:27:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:27:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:46:29] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:51:29] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:59:12] FIRING: [11x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:01:29] FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:04:12] FIRING: [11x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:29] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1019:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:19:12] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:21:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:24:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:46:29] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:51:29] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:44:54] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:00:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:01:29] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1019:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:09:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:29] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1019:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:16:21] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1030 gradually with 4 steps - Pool es1030.eqiad.wmnet in after cloning [05:16:29] FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:16:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:19:11] marostegui@cumin1003 clone_es (PID 1785807) is awaiting input [05:21:29] FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:28:46] marostegui@cumin1003 clone_es (PID 1785807) is awaiting input [05:34:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:52] marostegui@cumin1003 clone_es (PID 1785807) is awaiting input [05:36:14] (03PS1) 10Marostegui: db2155: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194806 (https://phabricator.wikimedia.org/T406541) [05:36:33] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1027 gradually with 4 steps - Pool es1027.eqiad.wmnet in after cloning [05:37:03] (03CR) 10Marostegui: [C:03+2] db2155: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194806 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [05:37:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2155.codfw.wmnet with reason: Maintenance [05:37:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2155 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83686 and previous config saved to /var/cache/conftool/dbconfig/20251009-053730-marostegui.json [05:40:08] (03PS1) 10Marostegui: instances.yaml: Add es1050 and es1053 [puppet] - 10https://gerrit.wikimedia.org/r/1194808 (https://phabricator.wikimedia.org/T406488) [05:41:04] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1050 and es1053 [puppet] - 10https://gerrit.wikimedia.org/r/1194808 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [05:41:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:43:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add es1050 and es1053 depooled T406488', diff saved to https://phabricator.wikimedia.org/P83687 and previous config saved to /var/cache/conftool/dbconfig/20251009-054347-marostegui.json [05:43:51] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [05:45:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2155 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83688 and previous config saved to /var/cache/conftool/dbconfig/20251009-054548-root.json [05:46:29] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:47:12] (03PS1) 10Marostegui: installserver: Remove es1051 [puppet] - 10https://gerrit.wikimedia.org/r/1194812 (https://phabricator.wikimedia.org/T406488) [05:51:17] (03CR) 10Marostegui: [C:03+2] installserver: Remove es1051 [puppet] - 10https://gerrit.wikimedia.org/r/1194812 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [05:51:29] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:52:49] (03PS1) 10Arnaudb: gerrit: mod_qos revert to previous stable state [puppet] - 10https://gerrit.wikimedia.org/r/1194811 (https://phabricator.wikimedia.org/T406403) [05:59:48] FIRING: [2x] PuppetFailure: Puppet has failed on wdqs1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T0600) [06:00:05] marostegui, Amir1, and federico3: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T0600). [06:00:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2155 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83691 and previous config saved to /var/cache/conftool/dbconfig/20251009-060054-root.json [06:01:29] FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [06:01:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1030 gradually with 4 steps - Pool es1030.eqiad.wmnet in after cloning [06:01:51] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es1030.eqiad.wmnet onto es1053.eqiad.wmnet [06:02:32] FIRING: [4x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:06:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:09:54] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:09:54] (03CR) 10Arnaudb: [C:03+2] gerrit: mod_qos tweaks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193597 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [06:11:29] FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1019:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [06:15:33] (03CR) 10Arnaudb: gerrit: increase QS_ClientPrefer threshold (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194702 (owner: 10Dzahn) [06:16:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2155 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83694 and previous config saved to /var/cache/conftool/dbconfig/20251009-061600-root.json [06:22:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1027 gradually with 4 steps - Pool es1027.eqiad.wmnet in after cloning [06:22:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es1027.eqiad.wmnet onto es1050.eqiad.wmnet [06:26:19] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-main host [06:26:33] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-main host (duration: 00m 14s) [06:26:43] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-main host [06:26:56] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-main host (duration: 00m 13s) [06:27:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:27:23] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T405978, transfer to freshly reimaged host) xfer wikidata_main from wdqs1017.eqiad.wmnet -> wdqs1018.eqiad.wmnet w/ force delete existing files, repooling both afterwards [06:27:23] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1018 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:27:23] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1018 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:27:23] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1018 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:27:23] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1018 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:27:25] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T405978, transfer to freshly reimaged host) xfer wikidata_main from wdqs1017.eqiad.wmnet -> wdqs1018.eqiad.wmnet w/ force delete existing files, repooling both afterwards [06:27:27] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [06:27:35] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T405978, transfer to freshly reimaged host) xfer wikidata_main from wdqs1017.eqiad.wmnet -> wdqs1018.eqiad.wmnet w/ force delete existing files, repooling both afterwards [06:27:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:27:47] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1019 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:27:47] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1019 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:27:47] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1019 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:27:47] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1019 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:28:36] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T405978, transfer to freshly reimaged host) xfer wikidata_main from wdqs1020.eqiad.wmnet -> wdqs1019.eqiad.wmnet w/ force delete existing files, repooling both afterwards [06:28:49] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 146 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:28:58] !log ryankemper@cumin2002 conftool action : set/pooled=no:weight=10; selector: name=wdqs1019.* [06:31:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2155 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83696 and previous config saved to /var/cache/conftool/dbconfig/20251009-063106-root.json [06:36:15] (03CR) 10Muehlenhoff: [C:03+1] site.pp: reimage all hcaptcha nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/1194715 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [06:36:57] (03CR) 10Muehlenhoff: [C:03+1] conftool-data: add hcaptcha[12]00[12].wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1194722 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [06:39:24] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [06:48:34] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [06:50:32] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1019.eqiad.wmnet with OS bullseye [06:53:01] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [06:58:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me. The new FastFloat lib isn't marked as PIC, whole folly uses it which itself does use PIC? But if the build resuilt works" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1194687 (https://phabricator.wikimedia.org/T406522) (owner: 10Andrew Bogott) [07:00:04] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T0700). [07:00:04] lmora: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:07] (03Abandoned) 10Slyngshede: P:cache::haproxy add datacenter information to provenance [puppet] - 10https://gerrit.wikimedia.org/r/1184497 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:00:58] I'm adding a patch to the window [07:01:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194666 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [07:01:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192533 (https://phabricator.wikimedia.org/T387600) (owner: 10Kosta Harlan) [07:02:42] (03CR) 10Kosta Harlan: Add ReadingList Stream to EventStreamConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [07:03:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194666 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [07:04:40] (03PS13) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [07:05:54] !log installing Redis security updates [07:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:52] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:14:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1034 and es1029 T406488', diff saved to https://phabricator.wikimedia.org/P83697 and previous config saved to /var/cache/conftool/dbconfig/20251009-071430-marostegui.json [07:14:35] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [07:15:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1029,1034].eqiad.wmnet with reason: Cloning [07:16:09] (03Merged) 10jenkins-bot: hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194666 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [07:17:17] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1194666|hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type (T404204)]] [07:17:19] (03PS1) 10Marostegui: mariadb: Productionize es1052 [puppet] - 10https://gerrit.wikimedia.org/r/1194822 (https://phabricator.wikimedia.org/T406488) [07:17:20] T404204: Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204 [07:18:22] (03PS14) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [07:18:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194733 (owner: 10Bearloga) [07:19:57] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1052 [puppet] - 10https://gerrit.wikimedia.org/r/1194822 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [07:20:10] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T405978, transfer to freshly reimaged host) xfer wikidata_main from wdqs1017.eqiad.wmnet -> wdqs1018.eqiad.wmnet w/ force delete existing files, repooling both afterwards [07:20:14] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [07:20:31] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:20:54] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T405978, transfer to freshly reimaged host) xfer wikidata_main from wdqs1020.eqiad.wmnet -> wdqs1019.eqiad.wmnet w/ force delete existing files, repooling both afterwards [07:21:29] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1019:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [07:22:12] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1194666|hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type (T404204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:24:12] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone_es of es1029.eqiad.wmnet onto es1052.eqiad.wmnet [07:24:12] FIRING: [11x] SystemdUnitFailed: wdqs-blazegraph.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:24:17] RESOLVED: [8x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:25:06] !log kharlan@deploy2002 kharlan: Continuing with sync [07:25:36] (03PS1) 10Marostegui: mariadb: Productionize es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1194823 (https://phabricator.wikimedia.org/T406488) [07:27:11] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1194823 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui) [07:29:11] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194666|hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type (T404204)]] (duration: 11m 54s) [07:29:15] T404204: Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204 [07:31:08] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone_es of es1034.eqiad.wmnet onto es1057.eqiad.wmnet [07:32:55] On to the next ones [07:33:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192533 (https://phabricator.wikimedia.org/T387600) (owner: 10Kosta Harlan) [07:33:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194733 (owner: 10Bearloga) [07:34:39] (03CR) 10Thiemo Kreuz (WMDE): [C:03+2] tests: Remove usage of ReflectionProperty::setAccessible(), no-op [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194749 (https://phabricator.wikimedia.org/T406744) (owner: 10Jforrester) [07:35:07] (03Merged) 10jenkins-bot: EventStreamConfig: Fix user-agent exclusion config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192533 (https://phabricator.wikimedia.org/T387600) (owner: 10Kosta Harlan) [07:35:10] (03Merged) 10jenkins-bot: EventStreamConfig: fix IP auto reveal stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194733 (owner: 10Bearloga) [07:35:43] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192533|EventStreamConfig: Fix user-agent exclusion config (T387600)]], [[gerrit:1194733|EventStreamConfig: fix IP auto reveal stream]] [07:35:46] T387600: IP Auto-reveal: Agree and implement metrics and instrumentation plan - https://phabricator.wikimedia.org/T387600 [07:36:02] (03Merged) 10jenkins-bot: tests: Remove usage of ReflectionProperty::setAccessible(), no-op [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194749 (https://phabricator.wikimedia.org/T406744) (owner: 10Jforrester) [07:39:01] (03PS1) 10Marostegui: db2147: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194824 (https://phabricator.wikimedia.org/T406541) [07:40:10] (03CR) 10Marostegui: [C:03+2] db2147: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194824 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [07:40:47] !log kharlan@deploy2002 kharlan, bearloga: Backport for [[gerrit:1192533|EventStreamConfig: Fix user-agent exclusion config (T387600)]], [[gerrit:1194733|EventStreamConfig: fix IP auto reveal stream]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:40:50] T387600: IP Auto-reveal: Agree and implement metrics and instrumentation plan - https://phabricator.wikimedia.org/T387600 [07:40:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2147.codfw.wmnet with reason: Maintenance [07:40:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2147 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83698 and previous config saved to /var/cache/conftool/dbconfig/20251009-074055-marostegui.json [07:42:46] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1018.eqiad.wmnet with OS bullseye [07:43:40] !log kharlan@deploy2002 kharlan, bearloga: Continuing with sync [07:44:12] FIRING: [5x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:44:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:47:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-eqiad: fan failure on left tray [Oct 2025] - https://phabricator.wikimedia.org/T406554#11258314 (10Jclark-ctr) a:05VRiley-WMF→03Jclark-ctr [07:47:36] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192533|EventStreamConfig: Fix user-agent exclusion config (T387600)]], [[gerrit:1194733|EventStreamConfig: fix IP auto reveal stream]] (duration: 11m 53s) [07:47:40] T387600: IP Auto-reveal: Agree and implement metrics and instrumentation plan - https://phabricator.wikimedia.org/T387600 [07:48:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [07:49:02] jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade. [07:49:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [07:49:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2147 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83699 and previous config saved to /var/cache/conftool/dbconfig/20251009-074914-root.json [07:49:43] (03Merged) 10jenkins-bot: ConfirmEdit/hCaptcha: Implement automatic failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan) [07:51:29] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [07:51:32] I am seeing "unexpected commits pulled from origin for /srv/mediawiki-staging" [07:51:38] !log joal@deploy2002 Started deploy [analytics/refinery@af75327] (hadoop-test): Analytics deploy - druid pageviews_daily - TEST [analytics/refinery@af753272] [07:51:47] hashar: do you have some advice? [07:52:21] I did not even had my coffee yet! :D [07:52:30] I don't know what that means, that is on spiderpig output? [07:52:32] !log joal@deploy2002 Finished deploy [analytics/refinery@af75327] (hadoop-test): Analytics deploy - druid pageviews_daily - TEST [analytics/refinery@af753272] (duration: 00m 54s) [07:52:38] yeah [07:52:44] oh it looks to be from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1194749 [07:52:45] so it's fine [07:53:00] proceeding [07:53:05] !log joal@deploy2002 Started deploy [analytics/refinery@af75327]: Analytics deploy - druid pageviews_daily [analytics/refinery@af753272] [07:53:17] hmm [07:53:19] !log kharlan@deploy2002 sync-world failed: Command 'sudo -u mwbuilder /usr/local/bin/update-mediawiki-tools-release' returned non-zero exit status 1. (scap version: 4.213.0) (duration: 00m 00s) [07:53:32] so yeah that change has not been deployed [07:53:34] `fatal: unable to access 'https://gitlab.wikimedia.org/repos/releng/release.git/': The requested URL returned error: 502` [07:53:37] hmm [07:53:51] then I thought scap was smart enough to detect a change was a noop (cause it only touches beta or tests) [07:54:12] I'm trying again, assuming the gitlab access issue was transient [07:54:34] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1194671|ConfirmEdit/hCaptcha: Implement automatic failover (T404204)]] [07:54:35] or maybe the skip only happens when one attempts to backport said patch (eg `scap backport 1194749`) [07:54:37] T404204: Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204 [07:56:03] (03CR) 10Giuseppe Lavagetto: [C:04-1] "Overall LGTM: change the management of the lua private files directory as I suggested and you get my +1." [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:56:58] !log joal@deploy2002 Finished deploy [analytics/refinery@af75327]: Analytics deploy - druid pageviews_daily [analytics/refinery@af753272] (duration: 03m 53s) [07:57:15] !log joal@deploy2002 Started deploy [analytics/refinery@af75327] (thin): Analytics deploy - druid pageviews_daily - THIN [analytics/refinery@af753272] [07:57:20] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [07:57:25] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:27] (03PS1) 10Slyngshede: P:idp prepare for new Trixie hosts [puppet] - 10https://gerrit.wikimedia.org/r/1194829 (https://phabricator.wikimedia.org/T406455) [07:58:26] (03CR) 10Giuseppe Lavagetto: [C:04-1] "Sorry, re-thinking about it, I found a bigger issue: the routines called in the private repo can be quite expensive, so I'd only run that " [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:59:07] (03PS1) 10Muehlenhoff: Remove maps roles from maps-test* [puppet] - 10https://gerrit.wikimedia.org/r/1194842 (https://phabricator.wikimedia.org/T381565) [07:59:09] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1194671|ConfirmEdit/hCaptcha: Implement automatic failover (T404204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:59:25] !log joal@deploy2002 Finished deploy [analytics/refinery@af75327] (thin): Analytics deploy - druid pageviews_daily - THIN [analytics/refinery@af753272] (duration: 02m 10s) [08:00:05] jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T0800) [08:00:46] morning, it looks like there's a backport still going on, so holding the train for now [08:00:49] (03CR) 10Elukey: [C:03+2] sre.hardware.upgrade-firmware: fix ssd upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [08:00:57] (03CR) 10Marostegui: "It looks good to me, do you want a host to test? I can give you one from s4 as I am currently doing that section" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [08:02:25] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:27] (03PS15) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [08:03:42] !log kharlan@deploy2002 kharlan: Continuing with sync [08:04:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2147 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83700 and previous config saved to /var/cache/conftool/dbconfig/20251009-080420-root.json [08:05:51] (03CR) 10Elukey: [C:03+1] Remove maps roles from maps-test* [puppet] - 10https://gerrit.wikimedia.org/r/1194842 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:07:48] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194671|ConfirmEdit/hCaptcha: Implement automatic failover (T404204)]] (duration: 13m 14s) [08:07:51] T404204: Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204 [08:09:34] jnuche: good morning! It seems like kostajh backport has completed [08:09:46] I'm done [08:09:53] content transformer reverted the patch that inserted some `` in the table of content :-) [08:10:03] good morning, ack, I'll roll out the train in a few minutes [08:10:21] and Subbu was super happy doing it over SpiderPig [08:10:34] nice [08:11:39] (03PS16) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [08:12:33] !log elukey@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2044'] [08:12:47] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2044'] [08:13:01] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:13:34] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194875 (https://phabricator.wikimedia.org/T405678) [08:13:36] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194875 (https://phabricator.wikimedia.org/T405678) (owner: 10TrainBranchBot) [08:14:32] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194875 (https://phabricator.wikimedia.org/T405678) (owner: 10TrainBranchBot) [08:14:54] (03CR) 10Muehlenhoff: [C:03+2] Remove maps roles from maps-test* [puppet] - 10https://gerrit.wikimedia.org/r/1194842 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:15:46] (03PS17) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [08:15:56] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:18:12] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2043.codfw.wmnet'] [08:18:36] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2043.codfw.wmnet'] [08:18:42] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:19:14] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2044.codfw.wmnet'] [08:19:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2147 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83701 and previous config saved to /var/cache/conftool/dbconfig/20251009-081926-root.json [08:19:41] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2044.codfw.wmnet'] [08:19:45] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7238/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:19:56] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:20:08] (03CR) 10Slyngshede: [V:03+1] P:cache::haproxy copy private repo data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:20:13] (03CR) 10Slyngshede: [V:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:22:58] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.22 refs T405678 [08:23:04] T405678: 1.45.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T405678 [08:26:19] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:26:57] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:27:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11258496 (10cmooney) >>! In T404959#11255300, @VRiley-WMF wrote: > Okay, was looking at this issue a bit. There are currently two fiber... [08:31:31] (03PS1) 10Elukey: sre.hardware.upgrade-firmware: add comment [cookbooks] - 10https://gerrit.wikimedia.org/r/1194876 [08:31:46] (03PS18) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [08:34:29] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:34:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2147 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83702 and previous config saved to /var/cache/conftool/dbconfig/20251009-083432-root.json [08:36:32] (03PS1) 10Marostegui: db2179: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194877 (https://phabricator.wikimedia.org/T406541) [08:37:04] (03CR) 10Marostegui: [C:03+2] db2179: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194877 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [08:37:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2179.codfw.wmnet with reason: Maintenance [08:38:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2179 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83703 and previous config saved to /var/cache/conftool/dbconfig/20251009-083801-marostegui.json [08:39:49] (03PS1) 10Elukey: admin_ng: add the amdgpu-node-labeller clusterrole as optional RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194878 (https://phabricator.wikimedia.org/T373806) [08:41:25] (03PS2) 10Elukey: sre.hardware.upgrade-firmware: add comment [cookbooks] - 10https://gerrit.wikimedia.org/r/1194876 [08:41:35] (03Abandoned) 10Federico Ceratto: instances.yaml: Add es2049 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1184091 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [08:42:22] 10SRE-swift-storage, 10Ceph, 06collaboration-services, 10Data-Persistence-Backup: Evaluate generic backup tooling for object storage buckets - https://phabricator.wikimedia.org/T406824 (10Jelto) 03NEW [08:42:42] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11258575 (10Jelto) [08:43:16] (03PS2) 10Elukey: admin_ng: add the amdgpu-node-labeller clusterrole as optional RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194878 (https://phabricator.wikimedia.org/T373806) [08:44:29] (03CR) 10Marostegui: "I've been using this with a few hosts and works very nicely, what's pending to get it merged?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto) [08:44:54] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:44:59] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [08:46:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2179 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83704 and previous config saved to /var/cache/conftool/dbconfig/20251009-084614-root.json [08:48:12] (03PS1) 10Elukey: preseed: set ms-be2078 for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1194880 (https://phabricator.wikimedia.org/T404356) [08:48:22] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11258584 (10Jelto) 05Open→03Resolved Great, then I'll resolve this task. I opened {T406824} as a follow up to track the obj... [08:48:38] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Add banned IP list [puppet] - 10https://gerrit.wikimedia.org/r/1194881 (https://phabricator.wikimedia.org/T283948) [08:48:40] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Add per-IP rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1194882 (https://phabricator.wikimedia.org/T283948) [08:48:51] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [08:49:43] (03CR) 10Elukey: [C:03+2] sre.hardware.upgrade-firmware: add comment [cookbooks] - 10https://gerrit.wikimedia.org/r/1194876 (owner: 10Elukey) [08:50:23] (03PS2) 10Majavah: P:toolforge::k8s::haproxy: Add per-IP rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1194882 (https://phabricator.wikimedia.org/T283948) [08:50:26] (03PS5) 10Federico Ceratto: sanitize-wiki.py: Improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1191689 (https://phabricator.wikimedia.org/T366146) [08:51:29] (03CR) 10Marostegui: [C:03+1] sanitize-wiki.py: Improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1191689 (https://phabricator.wikimedia.org/T366146) (owner: 10Federico Ceratto) [08:52:15] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp2050.codfw.wmnet'] [08:52:21] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2050.codfw.wmnet'] [08:52:34] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2050.codfw.wmnet'] [08:53:09] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2044.codfw.wmnet'] [08:53:55] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2044.codfw.wmnet'] [08:54:49] (03CR) 10Federico Ceratto: "It's pending review if someone wants to take the time ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto) [09:01:03] (03PS1) 10Elukey: sre.hardware.upgrade-firmware: use lower when matching firmware versions [cookbooks] - 10https://gerrit.wikimedia.org/r/1194883 (https://phabricator.wikimedia.org/T392851) [09:01:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2179 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83705 and previous config saved to /var/cache/conftool/dbconfig/20251009-090120-root.json [09:03:36] (03PS1) 10Marostegui: db1252: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194884 (https://phabricator.wikimedia.org/T406541) [09:04:15] (03CR) 10Marostegui: [C:03+2] db1252: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194884 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [09:04:39] (03PS1) 10Muehlenhoff: Syncronise the Hiera settings for the bookworm maps masters to the role settings [puppet] - 10https://gerrit.wikimedia.org/r/1194886 (https://phabricator.wikimedia.org/T381565) [09:05:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1252.eqiad.wmnet with reason: Maintenance [09:05:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1252 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83706 and previous config saved to /var/cache/conftool/dbconfig/20251009-090516-marostegui.json [09:06:17] (03CR) 10Elukey: "prerequisite: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194639" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194878 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [09:10:50] (03PS2) 10Federico Ceratto: preseed.yaml: Remove es2053 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1194643 (https://phabricator.wikimedia.org/T402859) [09:10:50] (03PS2) 10Federico Ceratto: es2053.yaml: Prepare es2053 for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1194644 (https://phabricator.wikimedia.org/T402859) [09:12:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194886 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:13:20] (03PS1) 10Majavah: P:wmcs::metricsinfra::haproxy: Use default config template [puppet] - 10https://gerrit.wikimedia.org/r/1194887 [09:13:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1252 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83707 and previous config saved to /var/cache/conftool/dbconfig/20251009-091322-root.json [09:13:30] (03CR) 10Marostegui: clone_es.py: clone readonly es* hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto) [09:14:16] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7239/co" [puppet] - 10https://gerrit.wikimedia.org/r/1194887 (owner: 10Majavah) [09:15:13] (03CR) 10Klausman: [C:03+1] profile::amd_gpu: add initial support for the k8s node labeller [puppet] - 10https://gerrit.wikimedia.org/r/1194639 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [09:15:19] (03CR) 10Klausman: [C:03+1] admin_ng: add the amdgpu-node-labeller clusterrole as optional RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194878 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [09:16:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2179 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83708 and previous config saved to /var/cache/conftool/dbconfig/20251009-091626-root.json [09:17:09] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1194883 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [09:17:12] jouncebot: nowandnext [09:17:12] For the next 0 hour(s) and 42 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T0800) [09:17:12] In 0 hour(s) and 42 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1000) [09:17:24] (03CR) 10Elukey: [C:03+2] sre.hardware.upgrade-firmware: use lower when matching firmware versions [cookbooks] - 10https://gerrit.wikimedia.org/r/1194883 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [09:17:40] jnuche / hashar can I backport the patch for T406707 ? [09:17:40] T406707: PHP Warning: Undefined array key 20250928162850 - https://phabricator.wikimedia.org/T406707 [09:18:03] (03PS1) 10Kosta Harlan: Check against correct key in sortEntitiesByTimestamp [extensions/CheckUser] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194890 (https://phabricator.wikimedia.org/T406707) [09:18:34] kostajh: yeah, fine by me [09:18:59] ok, I will start that then [09:19:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194890 (https://phabricator.wikimedia.org/T406707) (owner: 10Kosta Harlan) [09:21:03] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2044.codfw.wmnet'] [09:21:19] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2044.codfw.wmnet'] [09:22:22] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::k8s::haproxy: Add banned IP list [puppet] - 10https://gerrit.wikimedia.org/r/1194881 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:22:34] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Add banned IP list [puppet] - 10https://gerrit.wikimedia.org/r/1194881 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:22:42] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Add per-IP rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1194882 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:23:05] (03CR) 10Vgutierrez: [C:03+1] "nice job" [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427) (owner: 10Fabfur) [09:23:18] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::k8s::haproxy: Add per-IP rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1194882 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:23:38] (03PS3) 10Majavah: P:toolforge::k8s::haproxy: Add per-IP rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1194882 (https://phabricator.wikimedia.org/T283948) [09:23:38] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2045.codfw.wmnet'] [09:23:56] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2045.codfw.wmnet'] [09:24:03] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2046.codfw.wmnet'] [09:24:03] (03PS2) 10Majavah: P:wmcs::metricsinfra::haproxy: Use default config template [puppet] - 10https://gerrit.wikimedia.org/r/1194887 [09:25:57] (03PS1) 10Muehlenhoff: Syncronise the Hiera settings for the bookworm maps replicas to the role settings [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565) [09:26:32] (03CR) 10CI reject: [V:04-1] Syncronise the Hiera settings for the bookworm maps replicas to the role settings [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:26:51] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Add per-IP rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1194882 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:27:17] (03PS2) 10Muehlenhoff: Syncronise the Hiera settings for the bookworm maps replicas [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565) [09:27:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1194880 (https://phabricator.wikimedia.org/T404356) (owner: 10Elukey) [09:28:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1252 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83709 and previous config saved to /var/cache/conftool/dbconfig/20251009-092827-root.json [09:28:33] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1194829 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:29:33] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra::haproxy: Use default config template [puppet] - 10https://gerrit.wikimedia.org/r/1194887 (owner: 10Majavah) [09:29:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:30:09] (03CR) 10Elukey: [C:03+2] preseed: set ms-be2078 for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1194880 (https://phabricator.wikimedia.org/T404356) (owner: 10Elukey) [09:30:36] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2046.codfw.wmnet'] [09:31:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2179 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83711 and previous config saved to /var/cache/conftool/dbconfig/20251009-093131-root.json [09:32:35] (03Merged) 10jenkins-bot: Check against correct key in sortEntitiesByTimestamp [extensions/CheckUser] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194890 (https://phabricator.wikimedia.org/T406707) (owner: 10Kosta Harlan) [09:32:52] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1194890|Check against correct key in sortEntitiesByTimestamp (T406707)]] [09:32:56] T406707: PHP Warning: Undefined array key 20250928162850 - https://phabricator.wikimedia.org/T406707 [09:34:05] (03PS1) 10Elukey: sre.hosts.provision: fix issue when moving a Dell host to UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892 [09:35:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:35:46] (03PS2) 10Tiziano Fogli: Enable profile::auto_restarts::service for the metamonitoring endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1194578 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:36:36] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:36:39] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1194890|Check against correct key in sortEntitiesByTimestamp (T406707)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:37:10] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:37:26] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [09:38:18] (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:38:19] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [09:38:50] (03CR) 10Tiziano Fogli: [C:03+1] Enable profile::auto_restarts::service for the metamonitoring endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1194578 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:39:20] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [09:39:58] !log kharlan@deploy2002 kharlan: Continuing with sync [09:40:07] (03CR) 10Tiziano Fogli: [C:03+2] check_gdnsd_checkconf: enable nrpe wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1184469 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [09:40:29] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [09:41:40] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:42:20] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for the metamonitoring endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1194578 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:43:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1252 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83713 and previous config saved to /var/cache/conftool/dbconfig/20251009-094333-root.json [09:44:11] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194890|Check against correct key in sortEntitiesByTimestamp (T406707)]] (duration: 11m 18s) [09:44:14] T406707: PHP Warning: Undefined array key 20250928162850 - https://phabricator.wikimedia.org/T406707 [09:48:49] (03PS1) 10Majavah: haproxy: Remove separate cloud::base class [puppet] - 10https://gerrit.wikimedia.org/r/1194895 (https://phabricator.wikimedia.org/T284558) [09:50:22] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7240/" [puppet] - 10https://gerrit.wikimedia.org/r/1194895 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [09:52:13] (03CR) 10Majavah: [V:03+1] "seems like zuul-haproxy-01.zuul.eqiad1.wikimedia.cloud uses a custom puppetserver which is not hooked up to PCC?" [puppet] - 10https://gerrit.wikimedia.org/r/1194895 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [09:52:14] (03CR) 10Clément Goubert: [C:03+1] Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [09:52:54] (03PS5) 10Aaron Schulz: Route "/api/rest_v1/?spec" requests to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) [09:53:56] (03CR) 10Clément Goubert: "Small change needed since we introduced fractional routing" [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [09:54:59] (03CR) 10Muehlenhoff: [C:03+1] "This looks good, the same change was already applied for the new EPP template used on Trixie and later (see comment inline)." [puppet] - 10https://gerrit.wikimedia.org/r/1194734 (owner: 10JHathaway) [09:55:46] (03PS3) 10Hnowlan: api-gateway: Add support for PHP_ENGINE cookie routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (owner: 10Scott French) [09:55:46] (03CR) 10Hnowlan: [C:03+1] "This is so much neater than I expected it to be, thank you for doing it!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (owner: 10Scott French) [09:58:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1252 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83715 and previous config saved to /var/cache/conftool/dbconfig/20251009-095839-root.json [09:59:51] (03CR) 10Muehlenhoff: [C:03+1] "Just checked on Bullseye: even if someone uses SSH 8.4 (the version shipped in Bullseye), the NIST variants aren't in the default value of" [puppet] - 10https://gerrit.wikimedia.org/r/1194734 (owner: 10JHathaway) [09:59:56] (03CR) 10Clément Goubert: [C:03+1] api-gateway: Add support for PHP_ENGINE cookie routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (owner: 10Scott French) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1000) [10:01:29] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:01:35] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:02:12] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:04:08] (03CR) 10Clément Goubert: [C:03+2] Handle transform/wikitext/to/lint(.*) requests routed to the gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189938 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [10:04:27] (03CR) 10Clément Goubert: [C:03+1] K8s reverse DNS delegation: remove wikikube-ctrl1001 and add new nets [dns] - 10https://gerrit.wikimedia.org/r/1194678 (https://phabricator.wikimedia.org/T383227) (owner: 10Cathal Mooney) [10:04:45] (03CR) 10Elukey: [C:03+1] Syncronise the Hiera settings for the bookworm maps replicas [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:05:33] PROBLEM - Host ms-be2078 is DOWN: PING CRITICAL - Packet loss = 100% [10:05:35] (03CR) 10Elukey: [C:03+1] Syncronise the Hiera settings for the bookworm maps masters to the role settings [puppet] - 10https://gerrit.wikimedia.org/r/1194886 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:05:58] (03Merged) 10jenkins-bot: Handle transform/wikitext/to/lint(.*) requests routed to the gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189938 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [10:08:06] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [10:08:40] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [10:08:49] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:09:00] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/1194895 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [10:09:01] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [10:09:20] (03CR) 10Majavah: [V:03+1 C:03+2] haproxy: Remove separate cloud::base class [puppet] - 10https://gerrit.wikimedia.org/r/1194895 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [10:09:32] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:09:54] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:10:05] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:11:34] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:11:45] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:11:51] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:12:01] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:15:21] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:17:49] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:17:55] 10SRE-SLO, 10EditCheck, 06Editing-team (Kanban Board), 07Essential-Work, 05Goal: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11258956 (10elukey) 05Open→03Resolved Closed the task in favor of T406836, since the work is done :) [10:20:08] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:20:19] (03CR) 10Elukey: [V:03+1 C:03+2] profile::amd_gpu: add initial support for the k8s node labeller [puppet] - 10https://gerrit.wikimedia.org/r/1194639 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [10:20:36] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:24:12] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:26:26] (03CR) 10Elukey: [C:03+2] admin_ng: add the amdgpu-node-labeller clusterrole as optional RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194878 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [10:26:44] jouncebot: nowandnext [10:26:44] For the next 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1000) [10:26:44] In 1 hour(s) and 33 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200) [10:26:44] In 1 hour(s) and 33 minute(s): Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200) [10:27:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:27:11] (03PS4) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194596 (https://phabricator.wikimedia.org/T406318) [10:28:42] elukey@cumin1003 provision (PID 1961053) is awaiting input [10:29:04] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:29:07] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:30:20] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:33:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:34:20] (03CR) 10Slyngshede: [C:03+2] P:idp prepare for new Trixie hosts [puppet] - 10https://gerrit.wikimedia.org/r/1194829 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [10:37:51] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:38:26] elukey: your patch leads to wide-spread Puppet failures on Hadoop nodes: [10:38:34] Function lookup() did not find a value for the name 'profile::kubernetes::cluster_name' [10:38:42] in /srv/puppet_code/environments/production/modules/profile/manifests/amd_gpu.pp, line: 4 [10:42:37] (03PS1) 10Muehlenhoff: Fix Hiera lookups for the node labeller on non ML roles [puppet] - 10https://gerrit.wikimedia.org/r/1194909 (https://phabricator.wikimedia.org/T373806) [10:42:46] ^ should fix it [10:42:53] (03CR) 10Fabfur: haproxy: try to parse also non utf8 characters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427) (owner: 10Fabfur) [10:43:43] (03PS3) 10Fabfur: haproxy: try to parse also non utf8 characters [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427) [10:46:26] (03CR) 10Cathal Mooney: [C:03+2] K8s reverse DNS delegation: remove wikikube-ctrl1001 and add new nets [dns] - 10https://gerrit.wikimedia.org/r/1194678 (https://phabricator.wikimedia.org/T383227) (owner: 10Cathal Mooney) [10:46:50] !log cmooney@dns2005 START - running authdns-update [10:47:09] moritzm: ah snap! Lemme check [10:47:45] !log cmooney@dns2005 END - running authdns-update [10:47:54] elukey: AFAICT https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194909 should fix it [10:49:39] moritzm: the problem IIUC is that String $kubernetes_cluster_name = lookup('profile::kubernetes::cluster_name'), is always executed, and it doesn't have a default value (my bad of course) [10:49:56] adding a default to undef should fix the problem, wdyt? [10:50:49] (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [10:51:44] (03CR) 10Vgutierrez: [C:03+1] haproxy: try to parse also non utf8 characters [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427) (owner: 10Fabfur) [10:51:45] ah, indeed. given that $kubernetes_cluster_name isn't used outside of the $enable_node_labeller conditional that should be fine [10:52:49] (03PS1) 10Elukey: profile::amd_gpu: relax hiera lookup for the kubernetes cluster metadata [puppet] - 10https://gerrit.wikimedia.org/r/1194910 [10:52:52] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1194811 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [10:52:54] moritzm: --^ [10:53:08] totally my bad, I didn't run pcc on hadoop [10:53:14] I always forget we have gpus in there too [10:54:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1194910 (owner: 10Elukey) [10:54:44] thanks for the ping and sorry for the noise [10:54:45] (03Abandoned) 10Muehlenhoff: Fix Hiera lookups for the node labeller on non ML roles [puppet] - 10https://gerrit.wikimedia.org/r/1194909 (https://phabricator.wikimedia.org/T373806) (owner: 10Muehlenhoff) [10:55:19] np, I was glad it wasn't a terrible issue with the Puppet servers instead :-) [10:55:34] (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194596 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [10:56:26] (03CR) 10Elukey: [C:03+2] profile::amd_gpu: relax hiera lookup for the kubernetes cluster metadata [puppet] - 10https://gerrit.wikimedia.org/r/1194910 (owner: 10Elukey) [10:57:14] (03Abandoned) 10Btullis: spark: provide CRUD rights on secret for spark-deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/902391 (https://phabricator.wikimedia.org/T332908) (owner: 10Nicolas Fraison) [10:57:48] !log installing qemu security updates [10:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:45] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:03:03] (03CR) 10Muehlenhoff: [C:03+2] Syncronise the Hiera settings for the bookworm maps masters to the role settings [puppet] - 10https://gerrit.wikimedia.org/r/1194886 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:08:20] (03CR) 10Muehlenhoff: [C:03+2] Syncronise the Hiera settings for the bookworm maps replicas [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:10:01] (03CR) 10Ladsgroup: [C:03+1] es2053.yaml: Prepare es2053 for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1194644 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:10:09] (03CR) 10Ladsgroup: [C:03+1] preseed.yaml: Remove es2053 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1194643 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:12:35] (03PS2) 10Ladsgroup: maintain-views: Add abusefilterblockeddomainhit to allowed log types [puppet] - 10https://gerrit.wikimedia.org/r/1194294 (https://phabricator.wikimedia.org/T406562) [11:12:40] (03CR) 10Ladsgroup: [V:03+2 C:03+2] maintain-views: Add abusefilterblockeddomainhit to allowed log types [puppet] - 10https://gerrit.wikimedia.org/r/1194294 (https://phabricator.wikimedia.org/T406562) (owner: 10Ladsgroup) [11:13:17] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.makevm for new host idp1005.wikimedia.org [11:13:19] !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox [11:14:14] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [11:16:49] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp1005.wikimedia.org - slyngshede@cumin1003" [11:18:14] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp1005.wikimedia.org - slyngshede@cumin1003" [11:18:14] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:18:15] !log slyngshede@cumin1003 START - Cookbook sre.dns.wipe-cache idp1005.wikimedia.org on all recursors [11:18:18] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp1005.wikimedia.org on all recursors [11:18:44] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp1005.wikimedia.org - slyngshede@cumin1003" [11:18:48] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp1005.wikimedia.org - slyngshede@cumin1003" [11:20:22] (03PS4) 10Revi: kowikisource: Add "해석" namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193521 (https://phabricator.wikimedia.org/T406405) [11:20:54] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host idp1005.wikimedia.org with OS trixie [11:21:09] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [11:21:40] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [11:23:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:27:56] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [11:28:29] (03PS4) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194597 (https://phabricator.wikimedia.org/T406318) [11:30:44] (03PS1) 10Muehlenhoff: installserver: Drop support for legacy startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) [11:32:43] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on idp1005.wikimedia.org with reason: host reimage [11:32:53] (03CR) 10CI reject: [V:04-1] installserver: Drop support for legacy startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [11:35:31] (03PS2) 10Muehlenhoff: installserver: Drop support for legacy startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) [11:35:56] jouncebot: nowandnext [11:35:56] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [11:35:56] In 0 hour(s) and 24 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200) [11:35:56] In 0 hour(s) and 24 minute(s): Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200) [11:36:24] okay, I wait until gerrit maint is over [11:36:53] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp1005.wikimedia.org with reason: host reimage [11:38:09] (03PS1) 10Revi: kowiki: Restrict move ratelimit for non-extendedconfirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194916 (https://phabricator.wikimedia.org/T406849) [11:38:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [11:39:17] jouncebot: nowandnext [11:39:17] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [11:39:17] In 0 hour(s) and 20 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200) [11:39:17] In 0 hour(s) and 20 minute(s): Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200) [11:40:04] !log jynus@cumin1002 START - Cookbook sre.hosts.reboot-single for host dbprov2007.codfw.wmnet [11:40:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194916 (https://phabricator.wikimedia.org/T406849) (owner: 10Revi) [11:41:43] (03PS2) 10Arnaudb: Revert^2 "gerrit: Switchover gerrit1003 → gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1193845 (https://phabricator.wikimedia.org/T387833) [11:41:43] (03PS2) 10Arnaudb: Revert^2 "gerrit: switchover from gerrit1003 to gerrit2003" [dns] - 10https://gerrit.wikimedia.org/r/1193846 (https://phabricator.wikimedia.org/T387833) [11:43:28] (03PS3) 10Muehlenhoff: installserver: Drop support for legacy atftpd startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) [11:45:03] 06SRE, 05MW-1.45-notes (1.45.0-wmf.22; 2025-10-07), 06Product Safety and Integrity (Sprint Apfel Strudel (Sep 29 - Oct 17)), 05WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11259418 (10k... [11:45:32] 06SRE, 05MW-1.45-notes (1.45.0-wmf.22; 2025-10-07), 06Product Safety and Integrity (Sprint Apfel Strudel (Sep 29 - Oct 17)), 05WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11259425 (10k... [11:45:45] 06SRE, 05MW-1.45-notes (1.45.0-wmf.22; 2025-10-07), 06Product Safety and Integrity (Sprint Apfel Strudel (Sep 29 - Oct 17)), 05WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11259426 (10k... [11:46:15] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dbprov2007.codfw.wmnet [11:47:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [11:51:44] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:53:09] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp1005.wikimedia.org with OS trixie [11:53:09] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp1005.wikimedia.org [11:53:16] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406799#11259462 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [11:53:20] !log disable puppet on A:cp to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194676 on cp5021 (T404427) [11:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:29] (03CR) 10Muehlenhoff: "(PCC failure is expected, the install servers use Puppet7 syntax)" [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [11:53:40] (03CR) 10Btullis: [V:03+1 C:03+2] Switch the wdqs-internal services from http to https [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) (owner: 10Btullis) [11:54:34] (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1193846 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [11:55:22] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11259467 (10Jclark-ctr) [11:55:29] (03CR) 10Fabfur: [C:03+2] haproxy: try to parse also non utf8 characters [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427) (owner: 10Fabfur) [11:56:05] (03CR) 10Jelto: [C:03+1] "lgtm, we should keep in mind to enable the backups I3dc22ca2c6eb89aade24601cf3699c51faf47fd6" [puppet] - 10https://gerrit.wikimedia.org/r/1193845 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [11:56:36] (03PS1) 10Muehlenhoff: atftpd: Drop service definition [puppet] - 10https://gerrit.wikimedia.org/r/1194917 (https://phabricator.wikimedia.org/T396487) [11:57:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11259470 (10Jclark-ctr) [11:57:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194917 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [11:57:54] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:59:10] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#11259473 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [11:59:33] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#11259477 (10Jclark-ctr) a:05Jclark-ctr→03VRiley-WMF [11:59:42] !log installing luajit security updates [11:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200) [12:00:05] arnaudb and hashar: Deploy window Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200) [12:00:16] yep [12:01:48] (03CR) 10Arnaudb: [C:03+2] Revert^2 "gerrit: Switchover gerrit1003 → gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1193845 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [12:01:57] (03CR) 10Arnaudb: [C:03+2] Revert^2 "gerrit: switchover from gerrit1003 to gerrit2003" [dns] - 10https://gerrit.wikimedia.org/r/1193846 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [12:02:05] !log arnaudb@dns1004 START - running authdns-update [12:03:37] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [12:03:41] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.topology-check (exit_code=0) Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [12:03:49] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.failover from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [12:04:31] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.read-only-toggle from gerrit1003.wikimedia.org [12:04:35] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.read-only-toggle (exit_code=0) from gerrit1003.wikimedia.org [12:05:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11259508 (10Jclark-ctr) Replacement drive has arrived [12:06:45] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.read-only-toggle from gerrit2003.wikimedia.org [12:07:13] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.read-only-toggle (exit_code=0) from gerrit2003.wikimedia.org [12:07:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11259511 (10MoritzMuehlenhoff) [12:10:07] !log enable puppet on A:cp-eqsin to deploy https://gerrit.wikimedia.org/r/1194676 (T404427) [12:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:25] FIRING: [7x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:12:42] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:12:54] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:13:20] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:13:20] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:13:22] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:13:26] !log slyngshede@cumin1003 START - Cookbook sre.ganeti.makevm for new host idp2005.wikimedia.org [12:13:27] !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox [12:14:12] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:14:16] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:14:42] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:14:58] ^ I silenced this for the gerrit switchover for one hour [12:15:14] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:08] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:08] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:22] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:36] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:36] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:38] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:16:54] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:17:18] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp2005.wikimedia.org - slyngshede@cumin1003" [12:17:49] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp2005.wikimedia.org - slyngshede@cumin1003" [12:17:50] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:17:50] !log slyngshede@cumin1003 START - Cookbook sre.dns.wipe-cache idp2005.wikimedia.org on all recursors [12:17:53] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp2005.wikimedia.org on all recursors [12:18:24] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp2005.wikimedia.org - slyngshede@cumin1003" [12:18:28] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp2005.wikimedia.org - slyngshede@cumin1003" [12:18:44] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:18:48] !log reloading haproxy on A:cp-eqsin (T404427) [12:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:54] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host idp2005.wikimedia.org with OS trixie [12:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:21:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bullseye [12:21:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:14] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:25:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:31:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:33:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11259564 (10BTullis) I can see from `dmesg -T` that the drive in question is `/dev/sde` It was remounted read-only on Oct 3rd. ` [Fri Oct 3 02:12:46 2025]... [12:36:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:04] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [12:37:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11259596 (10BTullis) OK, this is a software RAID10 volume. It looks like it is drive `/dev/sde` that has failed. ` btullis@druid1011:~$ cat... [12:39:26] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on idp2005.wikimedia.org with reason: host reimage [12:44:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [12:44:54] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:46:02] PROBLEM - SSH on cumin1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:47:44] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:47:58] RECOVERY - SSH on cumin1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:49:15] FIRING: [2x] ProbeDown: Service idp2005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:49:36] jouncebot: nowandnext [12:49:36] For the next 0 hour(s) and 10 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200) [12:49:36] For the next 0 hour(s) and 10 minute(s): Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200) [12:49:36] In 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1300) [12:50:31] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp2005.wikimedia.org with reason: host reimage [12:51:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:16] (03CR) 10Arnaudb: [V:03+2 C:03+2] "slow rsync for local backup issue" [puppet] - 10https://gerrit.wikimedia.org/r/1194926 (owner: 10Arnaudb) [12:53:27] (03CR) 10Arnaudb: [V:03+2 C:03+2] "slow rsync for local backup issue" [dns] - 10https://gerrit.wikimedia.org/r/1194927 (owner: 10Arnaudb) [12:53:31] !log arnaudb@dns1004 START - running authdns-update [12:53:40] !log arnaudb@dns1004 START - running authdns-update [12:53:53] !log arnaudb@dns1004 START - running authdns-update [12:54:12] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:54:42] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:55:03] !log arnaudb@dns1004 END - running authdns-update [12:55:16] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:55:53] (03CR) 10Filippo Giunchedi: [C:03+2] Cloudcephosd1050: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) (owner: 10Andrew Bogott) [12:56:06] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:56:08] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:56:10] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:56:10] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:56:10] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1235 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [12:56:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:56:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:56:36] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:56:36] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:56:40] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:56:54] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:57:54] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:58:12] !log enable puppet on A:cp to deploy https://gerrit.wikimedia.org/r/1194676 (T404427) [12:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:58:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:58:22] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:58:23] (03CR) 10Federico Ceratto: "Yes please. Also I'm planning to add safety checks based on your comment above." [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [12:59:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11259678 (10Jclark-ctr) @BTullis Failed drive has been replaced! Thanks for the assistance [12:59:16] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [12:59:31] Nothing seems to be running in https://integration.wikimedia.org/zuul/. Is that currently expected? [12:59:59] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1300) [13:00:05] edsanders, Daimona, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] Dreamy_Jazz: sounds relatively expected to me, from what I could tell Gerrit only became available again a few minutes ago [13:00:22] I don’t know if the maintenance is considered over or not [13:00:32] Yeah, I was asking for the window [13:00:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bullseye [13:00:39] Given that there are patches to go through zuul [13:01:00] o/ [13:01:07] * Lucas_WMDE sees Dreamy_Jazz already resubmitted https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/1192178 and closes the tab [13:01:22] Well it isn't going currently :D [13:01:30] I see [13:01:48] anyway, I have a meeting in 15 minutes so I’m not the best person to run today’s backport window anyway… anyone else? ^^ [13:02:16] We probably can't run it until zuul is working again [13:02:33] Though my one doesn't need to go through gerrit, so I could make a start on that [13:03:04] (03CR) 10Federico Ceratto: [C:03+2] es2053.yaml: Prepare es2053 for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1194644 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [13:03:06] (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml: Remove es2053 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1194643 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [13:03:20] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:03:55] Can everyone in this window self deploy? May not need someone to handle it given that? [13:04:49] I'm going to start on my one given it doesn't need zuul to be working [13:04:51] I cannot [13:04:55] Dreamy_Jazz: go ahead [13:05:12] I should be around for the entire window if we are able to do the patches that need zuul [13:05:17] o/ [13:05:30] arnaudb, hashar: is the Gerrit maintenance still ongoing? (wondering because of the backport+config window) [13:05:42] I can self deploy - let me know when it's my turn [13:06:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on es2053.codfw.wmnet with reason: Setting up new ES host [13:08:26] https://phabricator.wikimedia.org/T406762#11259697 sounds like there are problems with the Gerrit maintenance 😬 [13:08:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bullseye [13:08:51] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp2005.wikimedia.org with OS trixie [13:08:51] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp2005.wikimedia.org [13:11:07] Running scap now for the private code chnage [13:12:44] okay, slack says the gerrit switchover is reverted once more (due to the issue mentioned above) [13:12:51] (I assume it’ll be sent to wikitech-l in a moment) [13:14:16] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:14:24] looks like Zuul is still asleep [13:14:54] (someone™ could in theory run Daimona’s maintenance script but I assume it would be better to do that only *after* deploying the config change) [13:15:07] Yeah, I think so [13:15:14] * Lucas_WMDE in meeting [13:15:42] Dreamy_Jazz: do I need to wait to start my config deployment? [13:15:57] Yeah. Zuul isn't running, so it won't be able to be merged [13:15:58] Shouldn't it be done before, since the config change removes the group? Or is it still possible to empty a group that no longer exists? [13:16:13] Additionally I am currently running scap at the moment [13:16:55] Re whether to run the script, maybe it should be run before but then the config change merged immediately afterwards? [13:17:13] To minimise the time that some proportion of users lack access to the tool [13:17:19] I will take care of zuul [13:17:21] yeah, we should at least know that we’ll be able to run the config change soon ^^ [13:17:25] thx hashar [13:17:37] Thanks [13:18:08] Zuul used to be able to reconnect to Gerrit [13:18:13] 10ops-eqiad, 06DC-Ops: Power Supply Redundancy alert on db1241 - https://phabricator.wikimedia.org/T406863 (10FCeratto-WMF) 03NEW [13:18:15] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:18:26] (03PS1) 10Elukey: sre.hosts.reimage: allow the usage of the pxe_media arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1194930 [13:19:20] Queue lenghts at 0 [13:19:27] there are 8 connections on gerrit originating from jenkins-bot [13:19:51] (03PS1) 10Arnaudb: Revert^4 "gerrit: Switchover gerrit1003 → gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1194931 [13:20:04] (03PS1) 10Arnaudb: Revert^4 "gerrit: switchover from gerrit1003 to gerrit2003" [dns] - 10https://gerrit.wikimedia.org/r/1194932 [13:20:12] (03PS1) 10Filippo Giunchedi: cloudceph: fix single-nic vlan interface specification [puppet] - 10https://gerrit.wikimedia.org/r/1194933 (https://phabricator.wikimedia.org/T405478) [13:20:14] Yeah that makes sense [13:20:49] Or if we're still waiting, I can split the config change [13:20:55] !log Closed jenkins-bot connections on Gerrit primary [13:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:14] yeah it reconnected [13:21:16] I'm done with my deploy, so the next config change should be ready to go once zuul is back [13:21:32] !log Zuul successfully reconnected to Gerrit [13:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:48] esanders: Do you have deployment rights and want to self deploy this? [13:22:00] edsanders: [13:22:09] yeah - I can [13:22:16] shall I start? [13:22:19] Over to you then, yes [13:22:38] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1194933 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [13:22:39] I can hang around for Daimona's changes if you just want to handle yours [13:22:56] I'm going to split my patch [13:23:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194588 (owner: 10Esanders) [13:24:08] (03Merged) 10jenkins-bot: Revert "Invalidate Flow cache on enwiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194588 (owner: 10Esanders) [13:24:10] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [13:24:26] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1194588|Revert "Invalidate Flow cache on enwiktionary"]] [13:24:49] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] "I'm live-iterating (!) on this atm and self-merging for expediency" [puppet] - 10https://gerrit.wikimedia.org/r/1194933 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [13:24:51] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:24:51] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] cloudceph: fix single-nic vlan interface specification [puppet] - 10https://gerrit.wikimedia.org/r/1194933 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [13:25:26] (03PS2) 10Daimona Eaytoy: Assign CampaignEvents user rights to autoconfirmed in small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194744 (https://phabricator.wikimedia.org/T401445) [13:25:26] (03PS1) 10Daimona Eaytoy: Delete the event-organizer user group on medium and small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194935 (https://phabricator.wikimedia.org/T401445) [13:25:27] (03PS1) 10CDanis: WIP: ja4h lua first draft, & concat [puppet] - 10https://gerrit.wikimedia.org/r/1194934 [13:26:09] (03CR) 10Muehlenhoff: "Looks good. I'll test this with sretest and will report back" [cookbooks] - 10https://gerrit.wikimedia.org/r/1194930 (owner: 10Elukey) [13:26:32] Done, and calendar updated [13:28:24] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-10-06-215412 to 2025-10-09-001812 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194936 (https://phabricator.wikimedia.org/T405130) [13:28:30] !log esanders@deploy2002 esanders: Backport for [[gerrit:1194588|Revert "Invalidate Flow cache on enwiktionary"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:28:40] my meeting is over, so I could also take over deployment if needed [13:28:47] Thanks. I guess the order is grant autoconfirmed access, run the script, then undefine the group [13:28:49] (didn’t take as long as expected ^^) [13:28:52] !log esanders@deploy2002 esanders: Continuing with sync [13:29:09] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-10-06-215412 to 2025-10-09-001812 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194936 (https://phabricator.wikimedia.org/T405130) (owner: 10Jforrester) [13:29:14] Lucas_WMDE: If you could that would be great, as I want to go eat some lunch [13:29:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [13:30:11] sure! [13:30:28] RECOVERY - dump of s1 in codfw on backupmon1001 is OK: Last dump for s1 at codfw (db2141) taken on 2025-10-09 11:44:02 (180 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [13:30:37] Thanks. See all of yous later \o [13:30:51] Thank you! (Also in a meeting BTW) [13:31:16] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-10-06-215412 to 2025-10-09-001812 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194936 (https://phabricator.wikimedia.org/T405130) (owner: 10Jforrester) [13:31:38] (03Abandoned) 10Dzahn: gerrit: increase QS_ClientPrefer threshold [puppet] - 10https://gerrit.wikimedia.org/r/1194702 (owner: 10Dzahn) [13:32:00] (03Restored) 10Dzahn: gerrit: increase QS_ClientPrefer threshold [puppet] - 10https://gerrit.wikimedia.org/r/1194702 (owner: 10Dzahn) [13:32:06] (03PS2) 10Dzahn: gerrit: increase QS_ClientPrefer threshold [puppet] - 10https://gerrit.wikimedia.org/r/1194702 (https://phabricator.wikimedia.org/T406774) [13:32:24] (03Abandoned) 10Dzahn: gerrit: increase QS_ClientPrefer threshold [puppet] - 10https://gerrit.wikimedia.org/r/1194702 (https://phabricator.wikimedia.org/T406774) (owner: 10Dzahn) [13:32:50] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:32:55] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194588|Revert "Invalidate Flow cache on enwiktionary"]] (duration: 08m 29s) [13:34:11] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [13:34:29] alright, I’ll take over [13:35:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194744 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [13:36:07] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [13:36:14] (03Merged) 10jenkins-bot: Assign CampaignEvents user rights to autoconfirmed in small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194744 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [13:36:33] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1194744|Assign CampaignEvents user rights to autoconfirmed in small and medium wikis (T401445)]] [13:36:36] T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445 [13:36:53] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [13:37:11] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [13:37:59] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [13:41:06] !log lucaswerkmeister-wmde@deploy2002 daimona, lucaswerkmeister-wmde: Backport for [[gerrit:1194744|Assign CampaignEvents user rights to autoconfirmed in small and medium wikis (T401445)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:41:32] Daimona: please test the first half :) [13:41:46] (or the first third? because the maintenance script is also a step? whatever ^^) [13:41:59] (03PS1) 10Muehlenhoff: hcaptcha_proxy: Select the custom nginx provider instead of extras [puppet] - 10https://gerrit.wikimedia.org/r/1194940 (https://phabricator.wikimedia.org/T405631) [13:43:53] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [13:43:56] LGTM! [13:44:10] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox [13:44:15] !log lucaswerkmeister-wmde@deploy2002 daimona, lucaswerkmeister-wmde: Continuing with sync [13:44:16] ok! [13:44:19] (me too) [13:44:34] (03PS1) 10Elukey: Add the node labeller binary to the package. [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) [13:45:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bullseye [13:45:30] (03CR) 10Ssingh: [C:03+1] "Thanks, sounds good. https://wiki.debian.org/Nginx is not updated, so what's another good way of finding out what goes in all the differen" [puppet] - 10https://gerrit.wikimedia.org/r/1194940 (https://phabricator.wikimedia.org/T405631) (owner: 10Muehlenhoff) [13:45:52] (03PS2) 10CDanis: WIP: ja4h lua first draft, & concat [puppet] - 10https://gerrit.wikimedia.org/r/1194934 [13:48:15] Daimona: for the maintenance script, what exactly should I run? [13:48:22] my feeling would be that we want it *with* --create-log [13:48:24] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194744|Assign CampaignEvents user rights to autoconfirmed in small and medium wikis (T401445)]] (duration: 11m 51s) [13:48:28] T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445 [13:48:34] and --log-reason='[[phabricator:T401445|T401445]]' or something like that [13:48:39] I wasn't sure about the exact options because I've never done this before [13:48:48] I’m just looking at the help output on my localhost ^^ [13:48:48] (03CR) 10Muehlenhoff: "The underlying idea is that everyone should only use the generic nginx binary and then install libnginx-mod-foo packages as they need them" [puppet] - 10https://gerrit.wikimedia.org/r/1194940 (https://phabricator.wikimedia.org/T405631) (owner: 10Muehlenhoff) [13:48:51] I'm also not sure how many users will be affected [13:48:52] * Lucas_WMDE searches SAL [13:49:03] Maybe not too many, which is why we're changing this in the first place [13:49:22] yeah that would’ve been my guess [13:49:34] https://sal.toolforge.org/production?p=0&q=emptyUserGroup*&d= suggests the create log option might be pretty new [13:50:01] (03CR) 10Muehlenhoff: [C:03+2] hcaptcha_proxy: Select the custom nginx provider instead of extras [puppet] - 10https://gerrit.wikimedia.org/r/1194940 (https://phabricator.wikimedia.org/T405631) (owner: 10Muehlenhoff) [13:50:03] Dreamy_Jazz: if you’re still here – apparently you created (or backported) that option in May, but I don’t see a SAL entry using the option; were there any problems or was it just not !log’ed? [13:50:12] * Lucas_WMDE looks at the task [13:50:40] ok https://phabricator.wikimedia.org/T393360#10867211 sounds like it worked fine [13:51:33] (03CR) 10Vgutierrez: "pretty cool job <3" [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (owner: 10CDanis) [13:51:34] Yeah, something pointing to the phab task would work I think [13:51:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11260003 (10Jclark-ctr) I will work on this tomorrow since i am using an older drive i want to make sure it is wiped prior to installing. [13:51:54] (03CR) 10Cathal Mooney: "Patch seems sane. Not terribly familiar with how this works but if it's already ok on the bookworm install servers makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [13:52:00] (03CR) 10Cathal Mooney: [C:03+1] installserver: Drop support for legacy atftpd startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [13:52:25] hmm [13:52:43] does mwscript-k8s --dblist support expressions [13:52:45] --help says it does [13:53:21] so, I would run: [13:53:22] mwscript-k8s --comment=T401445 --sal --dblist=small+medium -- emptyUserGroup --create-log --log-reason='[[phabricator:T401445|T401445]]' event-organizer [13:53:39] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [13:53:46] Seems correct [13:53:52] running [13:54:00] !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: foreachwikiindblist small+medium emptyUserGroup --create-log '--log-reason=[[phabricator:T401445|T401445]]' event-organizer # T401445 [13:54:01] Wish it had a dry-run but nvm [13:54:03] T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445 [13:54:17] oh, oops, I forgot to --follow [13:54:17] meh [13:54:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194935 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [13:54:54] Don't forget to --follow and smash that like button (?) [13:55:05] :trout: [13:55:09] it failed :( [13:55:14] can I figure out why though [13:55:16] (03Merged) 10jenkins-bot: Delete the event-organizer user group on medium and small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194935 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [13:55:27] the output from `K8S_CLUSTER=codfw KUBECONFIG=/etc/kubernetes/mw-script-codfw.config kubectl logs -f job/mw-script.codfw.tbw487ih mediawiki-tbw487ih-app` is useless [13:55:32] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1194935|Delete the event-organizer user group on medium and small wikis (T401445)]] [13:55:34] “emptyUserGroup: Running on small+medium” and that’s it [13:55:43] That's not very informative [13:55:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bullseye [13:55:50] I… guess I should abort that scap once it’s hit mwdebug [13:56:03] lemme just try running it on small and medium separately [13:56:13] Yeah thought the same [13:56:28] I don’t actually know if + is a valid dblist operator, I just assumed [13:56:32] but I don’t use dblist maths often ^^ [13:56:34] !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: foreachwikiindblist small emptyUserGroup --create-log '--log-reason=[[phabricator:T401445|T401445]]' event-organizer # T401445 [13:56:41] yeah that looks much better [13:56:42] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [13:56:43] guess that was it then [13:56:56] a lot of “group was empty” so far [13:56:58] 4 users in one wiki [13:57:02] I’ll export the output later [13:57:07] (could’ve tee’d it to a file, meh) [13:57:40] (03CR) 10Ssingh: [C:03+2] site.pp: reimage all hcaptcha nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/1194715 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [13:58:10] Yeah I don't think we're expecting many changes [13:59:02] (03PS1) 10Arnaudb: gerrit: local backup on source server only [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833) [13:59:02] (03CR) 10Arnaudb: "simple step to speed up next switchover" [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [13:59:35] (03PS2) 10Arnaudb: gerrit: local backup on source server only [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833) [13:59:40] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1050.eqiad.wmnet [14:00:02] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1194935|Delete the event-organizer user group on medium and small wikis (T401445)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:00:05] T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445 [14:00:18] !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: foreachwikiindblist medium emptyUserGroup --create-log '--log-reason=[[phabricator:T401445|T401445]]' event-organizer # T401445 [14:01:01] let’s let the medium run finish before syncing [14:01:13] but Daimona can you test the second change in the meantime? [14:01:27] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha1001.wikimedia.org with OS bookworm [14:01:40] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha2001.wikimedia.org with OS bookworm [14:02:17] !log for the record, the `foreachwikiindblist small+medium emptyUserGroup` maintenance script run (for T401445) did *not* work, running the maintenance script separately for small and medium worked better [14:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:47] (I don’t like the idea that someone might pull the broken command out of the SAL without realizing it, hence this log ^^) [14:03:31] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha1002.wikimedia.org with OS bookworm [14:03:35] Well but they'd need to read that entry before copypasting the command :D [14:03:42] Anyway, testing the second part now [14:03:44] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha2002.wikimedia.org with OS bookworm [14:03:50] yeah, that’s why I tried to include some of the same terms someone might’ve been searching for :P [14:03:52] thanks [14:04:59] User group removal looks good too [14:05:33] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Continuing with sync [14:05:33] ok! [14:05:39] (03CR) 10CDanis: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (owner: 10CDanis) [14:05:44] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1050.eqiad.wmnet [14:08:17] !log restart pybal on lvs1020 to pick up WDQS changes [14:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:31] jouncebot: nowandnext [14:09:32] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [14:09:32] In 0 hour(s) and 20 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1430) [14:09:54] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:10:19] elukey@cumin1003 provision (PID 1990297) is awaiting input [14:10:20] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194935|Delete the event-organizer user group on medium and small wikis (T401445)]] (duration: 14m 47s) [14:10:23] T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445 [14:12:51] !log UTC afternoon backport+config window done [14:12:52] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha1001.wikimedia.org with reason: host reimage [14:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:15] Thanks Lucas! [14:13:21] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [14:13:23] np :) [14:13:49] (03CR) 10Ladsgroup: Avoid using wikitech dblist in configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [14:15:14] Wait [14:15:22] Did I just screw up and the group didn't need to be empties [14:15:23] d [14:16:52] (03CR) 10Ladsgroup: Avoid using wikitech dblist in configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [14:17:56] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha1002.wikimedia.org with reason: host reimage [14:18:28] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha2001.wikimedia.org with reason: host reimage [14:18:54] RECOVERY - dump of s3 in codfw on backupmon1001 is OK: Last dump for s3 at codfw (db2239) taken on 2025-10-09 11:44:02 (127 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [14:19:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha1001.wikimedia.org with reason: host reimage [14:20:05] (03CR) 10JHathaway: [C:03+2] sshd: use the default KexAlgorithms algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1194734 (owner: 10JHathaway) [14:20:29] (03CR) 10Hashar: [C:04-1] "-1 to ack the remarks mentioned by Tacsipacsi. I'll take them in account and amend, but not today :]" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 (owner: 10Hashar) [14:20:30] (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194597 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [14:21:41] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha2001.wikimedia.org with reason: host reimage [14:21:51] (03PS4) 10Scott French: api-gateway: Add support for PHP_ENGINE cookie routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955) [14:23:04] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha2002.wikimedia.org with reason: host reimage [14:23:25] (03CR) 10Scott French: "Thank you both for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [14:23:34] (03PS4) 10Scott French: rest-gateway: Divert PHP_ENGINE=8.3 requests to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194791 (https://phabricator.wikimedia.org/T405955) [14:24:54] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:22] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:26:56] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-main_443: Servers wdqs1026.eqiad.wmnet are marked down but pooled: wdqs-internal-scholarly_443: Servers wdqs1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:27:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:28:26] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha1002.wikimedia.org with reason: host reimage [14:28:42] (03PS1) 10Bking: wdqs-internal-main,scholarly: Update health check for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1194956 (https://phabricator.wikimedia.org/T193473) [14:29:12] !log rest.php group2-except-enwiki on rest-gateway at 10% [14:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:52] (03PS2) 10CDanis: wdqs-internal-main,scholarly: Update health check for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1194956 (https://phabricator.wikimedia.org/T193473) (owner: 10Bking) [14:29:54] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194956 (https://phabricator.wikimedia.org/T193473) (owner: 10Bking) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1430) [14:31:31] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha2002.wikimedia.org with reason: host reimage [14:34:24] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:35:03] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:35:31] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha1001.wikimedia.org with OS bookworm [14:36:28] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2046.codfw.wmnet'] [14:36:45] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2046.codfw.wmnet'] [14:36:48] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2046.codfw.wmnet'] [14:37:01] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2046.codfw.wmnet'] [14:37:22] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha2001.wikimedia.org with OS bookworm [14:39:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11260203 (10Jclark-ctr) 05Open→03Resolved [14:39:37] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1003.eqiad.wmnet with OS bullseye [14:39:57] (03PS1) 10SBassett: OATHAuth Recovery Code code improvement [extensions/OATHAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194962 (https://phabricator.wikimedia.org/T406501) [14:42:26] RECOVERY - Host ms-be2078 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [14:42:35] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:42:47] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox [14:43:00] (03PS1) 10D3r1ck01: session: Improve logging for MultiBackendSessionStore [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194963 (https://phabricator.wikimedia.org/T402808) [14:43:18] (03CR) 10JHathaway: [C:03+1] installserver: Drop support for legacy atftpd startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [14:43:19] (03PS1) 10D3r1ck01: session: Improve logging for MultiBackendSessionStore [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194964 (https://phabricator.wikimedia.org/T402808) [14:44:02] (03CR) 10Ssingh: [C:03+1] wdqs-internal-main,scholarly: Update health check for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1194956 (https://phabricator.wikimedia.org/T193473) (owner: 10Bking) [14:44:05] (03CR) 10Ssingh: [C:03+2] wdqs-internal-main,scholarly: Update health check for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1194956 (https://phabricator.wikimedia.org/T193473) (owner: 10Bking) [14:44:09] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha1002.wikimedia.org with OS bookworm [14:44:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194963 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [14:44:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194964 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [14:46:10] (03PS2) 10SBassett: OATHAuth Recovery Code code improvement [extensions/OATHAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194962 (https://phabricator.wikimedia.org/T406501) [14:47:20] !log restart pybal on lvs1020 [14:47:20] (03PS3) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) [14:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:35] (03CR) 10SBassett: OATHAuth Recovery Code code improvement (031 comment) [extensions/OATHAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194962 (https://phabricator.wikimedia.org/T406501) (owner: 10SBassett) [14:47:44] (03CR) 10CI reject: [V:04-1] Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [14:47:56] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:48:34] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha2002.wikimedia.org with OS bookworm [14:49:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/OATHAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194962 (https://phabricator.wikimedia.org/T406501) (owner: 10SBassett) [14:49:31] (03CR) 10JHathaway: [C:03+1] sre.hosts.reimage: allow the usage of the pxe_media arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1194930 (owner: 10Elukey) [14:49:39] (03PS1) 10Filippo Giunchedi: cloudceph: handle double -> single NIC transition [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) [14:49:39] (03PS7) 10Federico Ceratto: clone_es.py: clone readonly es* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 [14:50:42] (03PS4) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194598 (https://phabricator.wikimedia.org/T406318) [14:53:37] (03PS1) 10Elukey: sre.hardware.upgrade-firmware: improve matching for SSD checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 [14:53:43] jouncebot: nowandnext [14:53:43] For the next 0 hour(s) and 6 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1430) [14:53:43] In 0 hour(s) and 6 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1500) [14:55:18] (03CR) 10Majavah: [C:04-1] cloudceph: handle double -> single NIC transition (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [14:58:08] RECOVERY - dump of s4 in codfw on backupmon1001 is OK: Last dump for s4 at codfw (db2239) taken on 2025-10-09 13:20:35 (238 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:00:02] (03CR) 10Muehlenhoff: [C:03+1] "I've confirmed with a reimage that it fixes passing a different installer environment." [cookbooks] - 10https://gerrit.wikimedia.org/r/1194930 (owner: 10Elukey) [15:00:05] jnuche and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1500) [15:00:21] (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194598 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [15:01:16] (03PS4) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194599 (https://phabricator.wikimedia.org/T406318) [15:02:54] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11260369 (10Jhancock.wm) @MatthewVernon ms-be2083 and ms-be2084 controllers have been swapped out. [15:05:12] (03CR) 10JHathaway: sre.hardware.upgrade-firmware: improve matching for SSD checks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (owner: 10Elukey) [15:07:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:35] (03CR) 10Elukey: sre.hardware.upgrade-firmware: improve matching for SSD checks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (owner: 10Elukey) [15:15:20] (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194599 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [15:16:29] (03PS1) 10KartikMistry: Update Recommendation API to 2025-10-09-145754-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194971 (https://phabricator.wikimedia.org/T406854) [15:18:37] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: allow the usage of the pxe_media arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1194930 (owner: 10Elukey) [15:21:21] I'll deploy recommendation API. Minor change. [15:21:37] (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2025-10-09-145754-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194971 (https://phabricator.wikimedia.org/T406854) (owner: 10KartikMistry) [15:22:52] (03PS2) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: enwiki 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194609 (https://phabricator.wikimedia.org/T406318) [15:23:41] (03Merged) 10jenkins-bot: Update Recommendation API to 2025-10-09-145754-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194971 (https://phabricator.wikimedia.org/T406854) (owner: 10KartikMistry) [15:23:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/2 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:25:16] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [15:26:47] !log sukhe@lvs1019:~$ sudo systemctl restart pybal.service [15:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:56] (03PS1) 10Mszwarc: arbcom_plwiki: Change favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194975 (https://phabricator.wikimedia.org/T406883) [15:31:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194975 (https://phabricator.wikimedia.org/T406883) (owner: 10Mszwarc) [15:31:47] (03CR) 10Vgutierrez: WIP: ja4h lua first draft, & concat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (owner: 10CDanis) [15:32:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:34:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:50] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [15:37:04] (03PS3) 10CDanis: WIP: ja4h lua first draft, & concat [puppet] - 10https://gerrit.wikimedia.org/r/1194934 [15:39:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:40:01] (03CR) 10Ssingh: [C:03+2] conftool-data: add hcaptcha[12]00[12].wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1194722 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [15:42:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Create boot environment of Bullseye with a 6.1 kernel - https://phabricator.wikimedia.org/T405102#11260770 (10MoritzMuehlenhoff) I built a bullseye d-i environment with the Linux 6.1 from Debian LTS (https://tracker.debian.org/pkg/linux-6.1) and after some needed fi... [15:45:56] (03PS1) 10Reedy: DNM (yet): Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194978 [15:46:52] (03PS1) 10Federico Ceratto: site.pp: Add es2052 [puppet] - 10https://gerrit.wikimedia.org/r/1194979 (https://phabricator.wikimedia.org/T402859) [15:48:27] Hey folks! Earlier today, I asked a user group to be emptied on several wikis which I later realized I shouldn't have done. I would like to revert these changes and re-add the affected users to the group (list in https://phabricator.wikimedia.org/P83722). Could I get another pair of eyes on this and run the script shortly? [15:48:38] !log sukhe@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: cluster=proxoid,name=hcatpcha.* [reason: setting weight for proxoid hcaptcha dedicated VM] [15:48:38] ("the script" = createAndPromote) [15:48:53] (03PS4) 10CDanis: WIP: ja4h lua first draft, & concat [puppet] - 10https://gerrit.wikimedia.org/r/1194934 [15:48:59] !log sukhe@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: cluster=proxoid,name=hcaptcha.* [reason: setting weight for proxoid hcaptcha dedicated VM] [15:51:36] (03PS1) 10Daimona Eaytoy: Revert "Delete the event-organizer user group on medium and small wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194981 [15:51:44] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:52:02] (03PS2) 10Daimona Eaytoy: Revert "Delete the event-organizer user group on medium and small wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194981 (https://phabricator.wikimedia.org/T401445) [15:52:03] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply Redundancy alert on db1241 - https://phabricator.wikimedia.org/T406863#11260824 (10VRiley-WMF) a:03VRiley-WMF [15:52:32] (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: enwiki 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194609 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert) [15:52:44] (03PS1) 10Ssingh: conftool-data: proxoid: remove urldownloader machines [puppet] - 10https://gerrit.wikimedia.org/r/1194982 (https://phabricator.wikimedia.org/T405631) [15:55:36] (03PS2) 10Elukey: sre.hosts.provision: fix issue when moving a Dell host to UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892 [15:56:06] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:57:36] (03CR) 10Ssingh: [C:03+2] conftool-data: proxoid: remove urldownloader machines [puppet] - 10https://gerrit.wikimedia.org/r/1194982 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [15:57:47] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:58:50] (03PS3) 10Elukey: sre.hosts.provision: fix issue when moving a Dell host to UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892 [15:59:37] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:59:57] (03PS2) 10Reedy: DNM (yet): Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194978 [16:00:05] jhathaway and moritzm: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:17] (03PS1) 10Ssingh: Revert "conftool-data: proxoid: remove urldownloader machines" [puppet] - 10https://gerrit.wikimedia.org/r/1194984 [16:00:29] (03CR) 10Ssingh: "emergency revert if required, do not merge" [puppet] - 10https://gerrit.wikimedia.org/r/1194984 (owner: 10Ssingh) [16:02:07] jhathaway: do you mind if I deploy a MediaWiki backport during the puppet window? [16:02:08] PROBLEM - Host ms-be2078 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:45] maybe two if Daimona still needs the config change to be reverted [16:03:12] I have some more coming so I might add them to the later window [16:03:17] tgr_: I don't think there's anything planned for the puppet window [16:03:51] ok, thanks, I'll go along then [16:04:38] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: fix issue when moving a Dell host to UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892 (owner: 10Elukey) [16:05:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194963 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [16:05:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194964 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [16:06:29] (03CR) 10Ssingh: [C:04-2] Revert "conftool-data: proxoid: remove urldownloader machines" [puppet] - 10https://gerrit.wikimedia.org/r/1194984 (owner: 10Ssingh) [16:06:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11260904 (10BTullis) 05Resolved→03Open a:05Jclark-ctr→03BTullis Reopening and assigning to myself, because there is a manual op to do here. I hope... [16:07:53] (03PS1) 10Daimona Eaytoy: Assign campaignevents-generate-invitation-lists right explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194986 (https://phabricator.wikimedia.org/T401445) [16:08:46] (03CR) 10CI reject: [V:04-1] Assign campaignevents-generate-invitation-lists right explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194986 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [16:09:57] (03Merged) 10jenkins-bot: session: Improve logging for MultiBackendSessionStore [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194963 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [16:10:04] (03Merged) 10jenkins-bot: session: Improve logging for MultiBackendSessionStore [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194964 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [16:10:27] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1194963|session: Improve logging for MultiBackendSessionStore (T402808 T405633 T405634)]], [[gerrit:1194964|session: Improve logging for MultiBackendSessionStore (T402808 T405633 T405634)]] [16:10:32] (03PS1) 10Cathal Mooney: inter.link: add BGP community in esams for ddos protection [homer/public] - 10https://gerrit.wikimedia.org/r/1194988 (https://phabricator.wikimedia.org/T400984) [16:10:36] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [16:10:36] T405633: Session data is authenticated, should not be an anonymous user - https://phabricator.wikimedia.org/T405633 [16:10:37] T405634: Authenticated data should not be in the anonymous store - https://phabricator.wikimedia.org/T405634 [16:11:27] (03CR) 10Muehlenhoff: "(PCC failure for P5 is expected)" [puppet] - 10https://gerrit.wikimedia.org/r/1194917 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [16:13:31] (03PS1) 10Vgutierrez: haproxy: Set and propagate X-Request-ID [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976) [16:14:09] !log tgr@deploy2002 tgr, d3r1ck01: Backport for [[gerrit:1194963|session: Improve logging for MultiBackendSessionStore (T402808 T405633 T405634)]], [[gerrit:1194964|session: Improve logging for MultiBackendSessionStore (T402808 T405633 T405634)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:14:16] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [16:14:54] huh [16:14:54] hmm [16:15:05] (03PS1) 10BryanDavis: developer-portal: Bump to 2025-10-07-043158-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194990 [16:15:24] ah from the earlier WDQS one [16:15:24] ok [16:15:38] fixing [16:17:20] (03PS2) 10Vgutierrez: haproxy: Set and propagate X-Request-ID [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976) [16:17:25] (03CR) 10CDanis: [C:03+1] haproxy: Set and propagate X-Request-ID [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976) (owner: 10Vgutierrez) [16:17:49] (03CR) 10CDanis: [C:03+1] haproxy: Set and propagate X-Request-ID [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976) (owner: 10Vgutierrez) [16:18:14] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [16:18:16] elukey@cumin1003 provision (PID 2011513) is awaiting input [16:18:22] !log sukhe@lvs2013:~$ sudo systemctl restart pybal.service [16:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:30] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:18:46] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:19:15] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11260967 (10Jhancock.wm) Power(W) = (Voltage(V) * Amperage(A) * sqrt{3} * PowerFactor(PF)) V = 208 A = 30 sqrt{3} = 1.732 PF = .8 (80% of power to allow for fluctuations) thus 208 * 30 *... [16:21:00] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11260969 (10Jhancock.wm) [16:21:24] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:21:48] always something eh lol [16:21:51] :) [16:23:33] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2025-10-07-043158-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194990 (owner: 10BryanDavis) [16:24:10] (03PS5) 10CDanis: WIP: ja4h lua first draft, & concat [puppet] - 10https://gerrit.wikimedia.org/r/1194934 [16:24:16] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [16:25:46] (03Merged) 10jenkins-bot: developer-portal: Bump to 2025-10-07-043158-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194990 (owner: 10BryanDavis) [16:26:11] (03CR) 10CDanis: [C:03+1] inter.link: add BGP community in esams for ddos protection [homer/public] - 10https://gerrit.wikimedia.org/r/1194988 (https://phabricator.wikimedia.org/T400984) (owner: 10Cathal Mooney) [16:26:24] !log tgr@deploy2002 tgr, d3r1ck01: Continuing with sync [16:27:31] (03PS4) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) [16:27:31] (03PS1) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on group0 [puppet] - 10https://gerrit.wikimedia.org/r/1194994 (https://phabricator.wikimedia.org/T385066) [16:27:33] (03PS1) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on group1 [puppet] - 10https://gerrit.wikimedia.org/r/1194995 (https://phabricator.wikimedia.org/T385066) [16:27:35] (03PS1) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1194996 (https://phabricator.wikimedia.org/T385066) [16:28:56] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11261019 (10Jhancock.wm) @elukey license uploaded for cp2056. should be good to try that one again. [16:30:34] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194963|session: Improve logging for MultiBackendSessionStore (T402808 T405633 T405634)]], [[gerrit:1194964|session: Improve logging for MultiBackendSessionStore (T402808 T405633 T405634)]] (duration: 20m 07s) [16:30:41] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [16:30:41] T405633: Session data is authenticated, should not be an anonymous user - https://phabricator.wikimedia.org/T405633 [16:30:42] T405634: Authenticated data should not be in the anonymous store - https://phabricator.wikimedia.org/T405634 [16:30:46] done, thanks [16:33:10] !log upgrade grafana-loki on grafana hosts T406478 [16:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:33] T406478: Scap logs on Grafana dashboards are broken - https://phabricator.wikimedia.org/T406478 [16:33:36] (03PS2) 10Daimona Eaytoy: Assign campaignevents-generate-invitation-lists right explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194986 (https://phabricator.wikimedia.org/T401445) [16:35:09] (03PS3) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194978 [16:35:17] jouncebot: nowandnext [16:35:17] For the next 0 hour(s) and 24 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1600) [16:35:17] In 0 hour(s) and 24 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1700) [16:35:17] In 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1700) [16:36:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194978 (owner: 10Reedy) [16:38:45] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [16:40:10] (03CR) 10Ottomata: "yes! TY!" [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976) (owner: 10Vgutierrez) [16:44:33] cmooney@cumin1003 netbox (PID 2014825) is awaiting input [16:44:54] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:47:25] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for inter.link transit IPs in drmrs - cmooney@cumin1003" [16:50:29] cmooney@cumin1003 netbox (PID 2014825) is awaiting input [16:55:33] (03CR) 10JHathaway: [C:03+1] atftpd: Drop service definition [puppet] - 10https://gerrit.wikimedia.org/r/1194917 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [16:56:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:57:39] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for inter.link transit IPs in drmrs - cmooney@cumin1003" [16:57:39] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:00:05] bd808: OwO what's this, a deployment window?? Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1700). nyaa~ [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1700) [17:00:39] o/ I have a developer-portal build to push out today. [17:02:38] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:04:05] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:05:15] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:06:04] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:06:30] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:07:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:08:32] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:22:06] (03CR) 10Jcrespo: [C:03+1] backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [17:22:14] (03PS6) 10Jcrespo: backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) [17:30:42] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs1020.eqiad.wmnet with reason: downtime lvs1020 to supress alerts about enp94s0f0np0 going down and losing backend connectivity [17:30:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11261370 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=366dc32d-e9d7-437c-98a7-7a9cd7979655) set by cmooney@cumin... [17:31:35] !log begin work to move lvs1020 uplink cable from ssw1-f1-eqiad to ssw1-e1-eqiad [17:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:06] (03PS1) 10Ssingh: url_downloader: remove hcaptcha proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) [17:33:49] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7242/co" [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [17:33:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/2 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:36:32] (03CR) 10Ssingh: [V:03+1 C:04-2] "The Phase 3 rollout of hCaptcha is on Wed Sep 15. I think we should not merge this until Sep 16 so that in case required, we can revert ba" [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [17:38:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-eqiad:xe-0/0/33 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:39:02] ^ Cathal is working on this so expected [17:40:03] (03CR) 10Jcrespo: [C:03+2] backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [17:43:25] sukhe: yeah sorry dc-ops too quick for me, I'll clear that now gimme another few [17:43:47] topranks: no worries at all on our end. take your time! [17:44:50] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 988180768 and 67 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:47:50] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 27200 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:48:18] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406166#11261445 (10ssingh) 05Open→03Resolved a:03ssingh Rolled out. [17:50:39] (03PS21) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [17:52:12] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:54:02] (03PS22) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [17:54:51] 10ops-codfw, 06DC-Ops: Alert for device ps1-a2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406911 (10phaultfinder) 03NEW [17:57:48] (03PS23) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [18:01:47] rolling out some envoy upgrades in staging [18:02:26] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/apertium: apply [18:02:41] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/apertium: apply [18:02:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:13] suspiciously timed but not related [18:03:46] drms [18:03:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194981 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [18:03:56] o/ [18:04:06] !incidents [18:04:06] 6853 (ACKED) [2x] ProbeDown sre (text-https:443 probes/service drmrs) [18:04:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194986 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [18:04:13] nel going up [18:04:47] we had a spike in 4XX [18:06:09] (03PS1) 10Daimona Eaytoy: Add separate user right for invitation lists [extensions/CampaignEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195019 (https://phabricator.wikimedia.org/T401445) [18:06:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CampaignEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195019 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [18:07:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:09:54] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:10:48] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 892324392 and 65 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:11:50] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 79736 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:18:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-eqiad:xe-0/0/33 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:24:54] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:27:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:30:42] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit1003), No backups: 5 (gerrit1003, ...), Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [18:31:37] (03PS1) 10Jforrester: i18n: Pull forward wikimedia-boardelection2025-notification-body updates [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195021 [18:35:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11261706 (10cmooney) 05Open→03Resolved Link has been moved, port is up on ssw1-e1-eqiad and MACs learnt on all vlans: ` cmooney... [18:35:40] !log cmooney@cumin1003 START - Cookbook sre.hosts.remove-downtime for lvs1020.eqiad.wmnet [18:35:40] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1020.eqiad.wmnet [18:36:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11261714 (10RobH) So I'm going to outline a few assumptions here and steal this back. If any of the following assumptions are incorrect, please let me know. We now have a move ahead... [18:36:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11261715 (10RobH) a:05BCornwall→03RobH [18:39:59] jouncebot: nowandnext [18:39:59] No deployments scheduled for the next 1 hour(s) and 20 minute(s) [18:39:59] In 1 hour(s) and 20 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T2000) [18:40:13] OK, I'll get an i18n change out, as it'll take too much time for a window. [18:40:29] (03CR) 10Jforrester: [C:03+2] i18n: Pull forward wikimedia-boardelection2025-notification-body updates [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195021 (owner: 10Jforrester) [18:40:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11261725 (10RobH) [18:43:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195021 (owner: 10Jforrester) [18:44:40] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Thu 06 Nov 2025 06:10:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [18:52:45] (03Merged) 10jenkins-bot: i18n: Pull forward wikimedia-boardelection2025-notification-body updates [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195021 (owner: 10Jforrester) [18:53:04] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1195021|i18n: Pull forward wikimedia-boardelection2025-notification-body updates]] [18:58:52] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1195021|i18n: Pull forward wikimedia-boardelection2025-notification-body updates]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:59:48] !log jforrester@deploy2002 jforrester: Continuing with sync [19:04:43] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195021|i18n: Pull forward wikimedia-boardelection2025-notification-body updates]] (duration: 11m 39s) [19:12:07] (03CR) 10Andrea Denisse: [C:03+1] "LGTM! I haven't tested sending message to private channels but it must be possible as the app includes the necessary scopes to do so. I su" [puppet] - 10https://gerrit.wikimedia.org/r/1194736 (owner: 10Cwhite) [19:51:44] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:51:50] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 81103992 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:52:50] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 104240 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:59:02] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1018.eqiad.wmnet [19:59:16] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1019.eqiad.wmnet [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T2000). [20:00:05] sbassett, Reedy, and Daimona: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:24] o/ [20:00:26] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1020.eqiad.wmnet [20:00:32] (03CR) 10Bking: [C:03+2] dse-k8s-worker2003: return to production role [puppet] - 10https://gerrit.wikimedia.org/r/1194305 (https://phabricator.wikimedia.org/T399778) (owner: 10Bking) [20:00:41] (03CR) 10Bking: [C:03+2] "self-merging in the interest of time." [puppet] - 10https://gerrit.wikimedia.org/r/1194305 (https://phabricator.wikimedia.org/T399778) (owner: 10Bking) [20:01:04] o/ [20:01:17] Just FYI - the order of my patches matters, and currently it’s reverse-listed :) [20:02:47] Who is deploying? :P [20:02:49] I can.. [20:02:58] (03PS1) 10CDanis: haproxy tls_terminator template cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1195041 [20:03:10] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195041 (owner: 10CDanis) [20:03:23] Well, Roan is def OOO. [20:03:31] Daimona: i18n changes? :P [20:03:47] Yeeeeah [20:03:49] (03CR) 10Reedy: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194978 (owner: 10Reedy) [20:03:58] (03CR) 10Reedy: [C:03+2] Revert "Delete the event-organizer user group on medium and small wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194981 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [20:03:58] You have been warned :D [20:04:08] (03CR) 10Reedy: [C:03+2] Assign campaignevents-generate-invitation-lists right explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194986 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [20:04:29] (03CR) 10Reedy: [C:03+2] OATHAuth Recovery Code code improvement [extensions/OATHAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194962 (https://phabricator.wikimedia.org/T406501) (owner: 10SBassett) [20:04:40] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [20:04:42] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194978 (owner: 10Reedy) [20:04:47] (03Merged) 10jenkins-bot: Revert "Delete the event-organizer user group on medium and small wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194981 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [20:04:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), and 2 others: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11261952 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host dse-k8s-worker2003.cod... [20:05:01] (03Merged) 10jenkins-bot: Assign campaignevents-generate-invitation-lists right explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194986 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [20:05:29] Shove those three out while others merge [20:06:31] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1194978|Update interwiki cache]], [[gerrit:1194981|Revert "Delete the event-organizer user group on medium and small wikis" (T401445)]], [[gerrit:1194986|Assign campaignevents-generate-invitation-lists right explicitly (T401445)]] [20:06:34] T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445 [20:07:16] I can take care of running createAndPromote, but I'm wondering: is there s way to delete log entries without manually messing with the DB? Since I'm basically reverting a group memberhsip change, I thought it'd be nice if we could delete all log entries that this ever happened [20:07:48] I'm not sure if I want to make manual changes though [20:08:50] I think it ends up being a specific purpose maintenance script written [20:09:34] Daimona: do you care much about testing those two? [20:09:48] I can do a quick test [20:10:57] !log reedy@deploy2002 daimona, reedy: Backport for [[gerrit:1194978|Update interwiki cache]], [[gerrit:1194981|Revert "Delete the event-organizer user group on medium and small wikis" (T401445)]], [[gerrit:1194986|Assign campaignevents-generate-invitation-lists right explicitly (T401445)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:11:10] (03Merged) 10jenkins-bot: OATHAuth Recovery Code code improvement [extensions/OATHAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194962 (https://phabricator.wikimedia.org/T406501) (owner: 10SBassett) [20:12:57] Looks good [20:13:03] !log reedy@deploy2002 daimona, reedy: Continuing with sync [20:13:14] (03CR) 10Dzahn: [C:03+1] sshd: use the default KexAlgorithms algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1194734 (owner: 10JHathaway) [20:15:01] Re deleting logs, deleting by primary key seems generic enough. I just don't know if there's anything somewhere that might reference those rows [20:15:22] Are there many? [20:15:40] (03CR) 10Dzahn: [C:03+2] cyberbot: use wmflib::debian_php_version to pick PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1193129 (owner: 10Dzahn) [20:17:10] (03CR) 10Dzahn: gerrit: local backup on source server only (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [20:17:17] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194978|Update interwiki cache]], [[gerrit:1194981|Revert "Delete the event-organizer user group on medium and small wikis" (T401445)]], [[gerrit:1194986|Assign campaignevents-generate-invitation-lists right explicitly (T401445)]] (duration: 10m 46s) [20:17:32] T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445 [20:17:32] that was a bit laggy [20:18:03] 100 log entries already there. Presumably 100 more once we revert the group membership change [20:18:30] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1194962|OATHAuth Recovery Code code improvement (T406501)]] [20:18:33] T406501: OATHAuth Recovery Code code improvement suggestions - https://phabricator.wikimedia.org/T406501 [20:19:23] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker2003.codfw.wmnet with reason: host reimage [20:19:45] Huh, weird unserialize() spike due to some template on strategywiki... [20:20:23] (03CR) 10Dzahn: [C:03+1] gerrit: mod_qos revert to previous stable state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194811 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [20:21:42] (03CR) 10Dzahn: [C:03+1] gerrit: mod_qos revert to previous stable state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194811 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [20:22:47] (03CR) 10Dzahn: [C:03+2] gerrit: mod_qos revert to previous stable state [puppet] - 10https://gerrit.wikimedia.org/r/1194811 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb) [20:22:59] !log reedy@deploy2002 sbassett, reedy: Backport for [[gerrit:1194962|OATHAuth Recovery Code code improvement (T406501)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:23:43] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker2003.codfw.wmnet with reason: host reimage [20:24:13] sbassett: Want to test much of this one? [20:24:13] I know you know this, Reedy, but the OATH patch is a no-op until we deploy the config changes [20:24:28] !log reedy@deploy2002 sbassett, reedy: Continuing with sync [20:24:30] Ha, see ^ [20:25:50] !log re-enabling QoS on gerrit servers - with previously stable config - T406774 [20:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:19] misses one of the bots that updates a ticket [20:28:49] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194962|OATHAuth Recovery Code code improvement (T406501)]] (duration: 10m 19s) [20:28:54] T406501: OATHAuth Recovery Code code improvement suggestions - https://phabricator.wikimedia.org/T406501 [20:29:37] !log re-enabled QoS on gerrit servers - with previously stable config - T406774 gerrit:1194811 [20:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:53] !log logmsgbot do you still log - test log T284123 [20:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:57] T284123: tcpircbot-logmsgbot was not able to deliver messages - https://phabricator.wikimedia.org/T284123 [20:33:43] (03PS2) 10Reedy: Enable New UI and Multiple Module support for OATHAuth in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193928 (https://phabricator.wikimedia.org/T399644) (owner: 10SBassett) [20:33:51] here we go :) [20:34:17] (03CR) 10SBassett: [C:03+1] Enable New UI and Multiple Module support for OATHAuth in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193928 (https://phabricator.wikimedia.org/T399644) (owner: 10SBassett) [20:34:18] sbassett: You'll have to remove your -2 ;) [20:34:25] (03CR) 10Reedy: [C:03+2] Enable New UI and Multiple Module support for OATHAuth in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193928 (https://phabricator.wikimedia.org/T399644) (owner: 10SBassett) [20:34:32] …and done [20:35:17] (03Merged) 10jenkins-bot: Enable New UI and Multiple Module support for OATHAuth in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193928 (https://phabricator.wikimedia.org/T399644) (owner: 10SBassett) [20:37:52] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 312192288 and 24 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:37:55] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1193928|Enable New UI and Multiple Module support for OATHAuth in Wikimedia production (T399644)]] [20:38:02] T399644: FY2025-26 WE4.6.2 Multiple Authenticators - https://phabricator.wikimedia.org/T399644 [20:38:20] So for the createAndPromote. I need to run a script on multiple wikis with different arguments for each invocation. Surely there must be a better way than invoking mwscript-k8s 100 times as the documentation says NOT to do? [20:38:52] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 202776 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:39:01] 06SRE, 06Infrastructure-Foundations: nodesource node22 apt mirror is broken - https://phabricator.wikimedia.org/T406623#11262222 (10Dzahn) [20:39:02] (The lame 100-invocation way is https://phabricator.wikimedia.org/P83722#336349) [20:39:28] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927 (10Lars) 03NEW [20:40:28] Daimona: Probably not [20:41:38] Ah well [20:41:49] (03CR) 10Cwhite: [C:03+2] "Great! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1194736 (owner: 10Cwhite) [20:42:04] !log reedy@deploy2002 reedy, sbassett: Backport for [[gerrit:1193928|Enable New UI and Multiple Module support for OATHAuth in Wikimedia production (T399644)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:42:05] Any SREs who could confirm and give a green light to run https://phabricator.wikimedia.org/P83722#336349 then? [20:42:24] OATH config patch has landed on the k8s-mwdebugs. So far looking good1 [20:42:45] Daimona: Just do it tbh... Unless you write a script that does createandpromote for N users per wiki... [20:42:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [20:42:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11262268 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host dse-k8s-worker2003.c... [20:43:37] Yeah I just wanna make sure I don't bring everything down. [20:44:15] It's more it's just not very efficient spinning up the workers for short lived jobs like this [20:44:21] (03PS1) 10Bking: dse-k8s-eqiad: Add records for opensearch-test and opensearch-ipoid [dns] - 10https://gerrit.wikimedia.org/r/1195048 (https://phabricator.wikimedia.org/T357753) [20:44:58] Yeah... Also, I'm prepared for this to take ages [20:45:06] It won't take that long tbh [20:45:14] But if you do them in smaller batches... [20:46:17] !log Run createAndPromote as in P83722#336349 (~100x, in series) to restore event-organizer membership # T401445 [20:46:17] sbassett: The fonts on Special:AccountSecurity look a bit weird [20:46:31] But I suspect that may be a bit RL module weirdness [20:46:34] Well, we shall see [20:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:00] T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445 [20:47:15] Reedy: Really? They seem ok in Chrome/MacOS for me? Or at least not noticeably different than mw-docker, patchdemo, beta... [20:47:24] check slack [20:48:01] It's not something I'm massively worried about though [20:48:13] >Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content. [20:48:16] That may be an actual bug [20:48:50] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:49:12] 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on dse-k8s-worker1003:9290 - https://phabricator.wikimedia.org/T406929 (10phaultfinder) 03NEW [20:49:27] I see a new recovery code has been DB persisted along side my TOTP [20:49:32] Reedy: so when you reload it looks ok or…? [20:50:03] (03PS7) 10LorenMora: Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) [20:50:07] Hooray, that’s working. I’ve tested several add/remove key, use new key workflows at this point. I’m not seeing any backend issues in Chrome/MacOS... [20:50:33] Nope [20:50:40] I bet that's a timeless skin issue [20:50:40] I’ve been able to use recovery codes and have them persist when a new one gets created as well. [20:50:42] I'll file a bug [20:50:52] (03PS1) 10Jdlrobson: Enable instrumentation of watchstar and other links that stopPropagation [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195049 (https://phabricator.wikimedia.org/T406390) [20:51:02] Yeah, has to be something at the skin or codex layer. If it looked a lot worse, I might hold off, but... [20:51:50] The UI in Vector seems as intended/expected for me. [20:51:54] Happy to continue? [20:52:16] I am, if we think minor color/font issues in some skins isn’t that big of a deal and/or can be addressed in the near future. [20:53:20] Personally, I’m far more worried about any backend/workflow issues, which I’m not seeing any for now. But that’s just MO. [20:53:34] If it was broken on vector/vector22/minerva, I'd be less happy to continue [20:53:41] Yes, agreed. [20:53:49] But that's also what it was more tested on, soo... [20:53:56] !log reedy@deploy2002 reedy, sbassett: Continuing with sync [20:54:00] jouncebot: nowandnext [20:54:00] For the next 0 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T2000) [20:54:00] In 0 hour(s) and 5 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T2100) [20:54:20] (03CR) 10Reedy: [C:03+2] Add separate user right for invitation lists [extensions/CampaignEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195019 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [20:57:45] Uhoh is votewiki down? https://vote.wikimedia.org/wiki/Main_Page [20:57:57] tzatziki: Down how? [20:57:59] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193928|Enable New UI and Multiple Module support for OATHAuth in Wikimedia production (T399644)]] (duration: 20m 04s) [20:58:03] T399644: FY2025-26 WE4.6.2 Multiple Authenticators - https://phabricator.wikimedia.org/T399644 [20:58:03] Original exception: [15fe86b3-3935-44ca-94fa-ef8e107a2e62] 2025-10-09 20:57:49: Fatal exception of type "TypeError" [20:58:24] >TypeError: MediaWiki\Extension\OATHAuth\Key\TOTPKey::__construct(): Argument #3 ($recoveryCodes) must be of type array, string given, called in /srv/mediawiki/php-1.45.0-wmf.22/extensions/OATHAuth/src/Key/TOTPKey.php on line 125 [20:58:40] Ugh [20:58:42] w [20:58:47] Why is that on votewiki [20:58:50] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:58:54] oh, this is a regression as of like two minutes ago? [20:58:58] nice [20:59:12] shall I file a bug? [20:59:19] An issue with non-SUL wikis? [20:59:40] Kinda looks like it [20:59:58] I don’t have a votewiki account, I don’t think. [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T2100) [21:00:15] https://phabricator.wikimedia.org/T406933 [21:00:19] tzatziki: I think it may "only" be an issue for you logged in people... [21:00:19] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for JavierMonton - https://phabricator.wikimedia.org/T406331#11262360 (10Ahoelzl) [21:00:21] https://phabricator.wikimedia.org/T406932 [21:00:40] oh, lol, snap. I'll merge my dupe [21:01:02] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for JavierMonton - https://phabricator.wikimedia.org/T406331#11262364 (10Ahoelzl) @RLazarus sorry for the direct ping, who could help with this? [21:01:09] * Jdlrobson prepares for Web Team deployment window [21:01:15] officewiki seems to work alright... [21:01:22] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for a-pizzata - https://phabricator.wikimedia.org/T406328#11262370 (10Ahoelzl) [21:01:26] (03PS24) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [21:01:32] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for a-pizzata - https://phabricator.wikimedia.org/T406328#11262372 (10Ahoelzl) @RLazarus sorry for the direct ping, who could help with this? [21:01:36] Collabwiki also works, ftr [21:01:39] Obviously it depends on traffic, but I'm only seeing this on vote [21:01:52] oh fucking lol [21:01:52] (03PS8) 10LorenMora: Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) [21:01:53] I see why [21:01:59] No one ever ran a prior maintenance script [21:02:00] :D [21:02:04] whoops [21:02:11] so there's recovery codes that are strings with commas [21:02:15] poor urchin votewiki [21:02:21] (03CR) 10LorenMora: Add ReadingList Stream to EventStreamConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [21:02:24] My script is ~40 done [21:02:32] Reedy: that's, uh, how from how many years ago? [21:02:43] taavi: Did we delete that one for string to array for recovery codes? [21:02:46] Ok, so not something I broke? [21:02:50] I have a feeling we might have [21:02:52] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:02:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:02:55] sbassett: yes and no [21:02:55] let' see [21:03:09] You've coded for the latest behaviour [21:03:15] Probably removing a back compat thing along the way [21:03:21] Oh fun [21:03:27] This affects ~6 users [21:03:41] Ok, so no rollback. [21:03:42] yeah can you just disable their 2fa and I'll contact them [21:03:43] yeah, we dropped the script in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OATHAuth/+/1189123 [21:03:44] * Reedy looks at tzatziki [21:03:52] if that's a solution [21:03:53] https://github.com/wikimedia/mediawiki-extensions-OATHAuth/blob/REL1_43/maintenance/UpdateTOTPScratchTokensToArray.php [21:03:56] It's in 1.43 [21:03:58] yes I'm one of them :( lol [21:04:09] Give me a few mins [21:04:12] I'm guessing we have this in other non-votewiki private wikis? [21:04:14] tx, Reedy [21:04:21] (03PS1) 10Dzahn: zuul: add host network to docker command for new zuul-web component [puppet] - 10https://gerrit.wikimedia.org/r/1195053 (https://phabricator.wikimedia.org/T405119) [21:04:37] (03Merged) 10jenkins-bot: Add separate user right for invitation lists [extensions/CampaignEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195019 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy) [21:04:42] I suspect I should run it on other wikis too... [21:04:47] (03CR) 10CI reject: [V:04-1] zuul: add host network to docker command for new zuul-web component [puppet] - 10https://gerrit.wikimedia.org/r/1195053 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [21:05:18] (03CR) 10Dzahn: [C:03+2] zuul: add host network to docker command for new zuul-web component [puppet] - 10https://gerrit.wikimedia.org/r/1195053 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [21:05:26] (03PS2) 10Dzahn: zuul: add host network to docker command for new zuul-web component [puppet] - 10https://gerrit.wikimedia.org/r/1195053 (https://phabricator.wikimedia.org/T405119) [21:05:47] And of course the script is still targetting oathauth_users :D [21:05:49] easily fixed [21:05:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195049 (https://phabricator.wikimedia.org/T406390) (owner: 10Jdlrobson) [21:07:57] tzatziki: How about now? [21:08:00] Seeing 8 TypeErrors in logstash for this issue rn, most of which I’m guessing are from the people in this chat :) [21:08:11] ah, no, it's not fixed [21:08:16] Reedy: -- [21:08:18] well. :D [21:08:31] Yes, I've probably refreshed 8 times. lol [21:08:43] It's by no means urgent btw. I just needed to see the voter list [21:08:45] (03CR) 10Dzahn: [C:03+2] zuul: add host network to docker command for new zuul-web component [puppet] - 10https://gerrit.wikimedia.org/r/1195053 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [21:08:50] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:11:12] Oof, seeing some more action in logstash for votewiki, which I assume is related to Reedy trying to fix things. Worst-case is maybe we temp disable the new UI/module support for votewiki or private wikis? [21:11:35] tzatziki: Third time lucky [21:11:47] nope :( [21:11:47] Original exception: [64c245a2-e0c4-4223-9fa2-58555714ec99] 2025-10-09 21:11:40: Fatal exception of type "TypeError" [21:11:49] or... [21:13:06] (03Merged) 10jenkins-bot: Enable instrumentation of watchstar and other links that stopPropagation [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195049 (https://phabricator.wikimedia.org/T406390) (owner: 10Jdlrobson) [21:14:00] tzatziki: ok [21:14:03] *now* it is fixed [21:14:12] Reedy: wahooo [21:14:17] indeed it works now! [21:14:18] Hooray! [21:15:04] So we need to run that maint script for a few more private wikis, most likely, then? [21:15:39] anything non CA [21:15:55] Daimona: Want your other patch deploying now? :P [21:17:06] done for private [21:17:10] Yup pls [21:17:47] Script is about 85% done [21:17:49] Done for fishbowl [21:18:00] We don't have any other non SUL groups do we... [21:18:13] Shouldn’t be many [21:18:28] ah, Jdlrobson is using their window :P [21:18:53] https://noc.wikimedia.org/conf/highlight.php?file=dblists/sul.dbexpr [21:18:58] >all.dblist + preinstall.dblist - fishbowl.dblist - private.dblis [21:19:22] ahh Reedy is that why I'm seeing "here were unexpected commits pulled from origin for /srv/mediawiki-staging/php-1.45.0-wmf.22 Continue with deployment (all patches will be deployed)? [y/N]:" ? [21:19:52] I'm not 100% sure what I should be doing now [21:20:12] I assume no is the correct answer? [21:20:24] Are you deploying more than one patch (in series)? [21:20:27] just one patch [21:20:33] I didn't start running mine till ~14 mins after the lock [21:20:42] and it wasn't working in my testing so I'm not sure what's happened [21:20:49] And scap hasn't checked it out [21:21:17] I am using https://spiderpig.wikimedia.org/ if you want to take a look [21:21:21] Jdlrobson: I'm pretty sure you can just continue, it's just kinda saying "hey, there's some other stuff that you might have meant to deploy" [21:22:28] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1195049|Enable instrumentation of watchstar and other links that stopPropagation (T406390)]] [21:22:39] oh... dancy about? [21:22:45] and/or thcipriani... [21:22:54] Does spiderpig just try and deploy all the things outstanding? [21:23:00] T406390: Pull data on watchlist star usage - https://phabricator.wikimedia.org/T406390 [21:23:25] (03PS1) 10Dzahn: zuul: add missing $host_ip variable to zuul-web class [puppet] - 10https://gerrit.wikimedia.org/r/1195062 (https://phabricator.wikimedia.org/T405119) [21:24:14] (03PS2) 10Dzahn: zuul: add missing $host_ip variable to zuul-web class [puppet] - 10https://gerrit.wikimedia.org/r/1195062 (https://phabricator.wikimedia.org/T405119) [21:24:30] (03CR) 10Dzahn: [C:03+2] zuul: add missing $host_ip variable to zuul-web class [puppet] - 10https://gerrit.wikimedia.org/r/1195062 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [21:24:45] That's right. If you answer yes, it deploys whatever is merged [21:24:58] marostegui@cumin1003 clone_es (PID 1943562) is awaiting input [21:25:00] Reedy: it looks like some CampaignEvents changes? [21:25:18] !log on db2202 cleaned up the tables I created for T400696 [21:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:23] T400696: FY25-26 WE1.4.1 RecentChanges database performance improvements - https://phabricator.wikimedia.org/T400696 [21:25:26] Yeah, I'd +2'd https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CampaignEvents/+/1195019 and then we had a 2FA issue on officewiki [21:25:27] My run of createAndPromote is done [21:25:33] *votewiki [21:27:36] Reedy: okay so I guess that's now being deployed? Was that tested or do we need to abort this? [21:28:30] I see I have options "Interrupt job" and "Kill job (not recommended)" but this is beyond my spiderpig training. I also need to step out in the next 10 mins for a doctor visit [21:29:12] I'd hope the master patch was tested :) [21:30:09] (03CR) 10Dzahn: [C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1195062 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [21:32:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:34:02] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:34:06] (03PS1) 10Ryan Kemper: wdqs: bring internal-scholarly wdqs2017 into svc [puppet] - 10https://gerrit.wikimedia.org/r/1195063 (https://phabricator.wikimedia.org/T405978) [21:35:04] (03CR) 10Bking: [C:03+1] wdqs: bring internal-scholarly wdqs2017 into svc [puppet] - 10https://gerrit.wikimedia.org/r/1195063 (https://phabricator.wikimedia.org/T405978) (owner: 10Ryan Kemper) [21:35:09] (03CR) 10Ryan Kemper: [C:03+2] wdqs: bring internal-scholarly wdqs2017 into svc [puppet] - 10https://gerrit.wikimedia.org/r/1195063 (https://phabricator.wikimedia.org/T405978) (owner: 10Ryan Kemper) [21:41:31] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply Redundancy alert on db1241 - https://phabricator.wikimedia.org/T406863#11262521 (10VRiley-WMF) 05Open→03Resolved This has been fixed. Loose cable. [21:41:37] (03PS5) 10Scott French: api-gateway: Add support for PHP_ENGINE cookie routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955) [21:41:37] (03PS5) 10Scott French: rest-gateway: Divert PHP_ENGINE=8.3 requests to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194791 (https://phabricator.wikimedia.org/T405955) [21:43:26] dancy: it's taking a lot longer than usual so it's making me a little nervous. I need to go to a doctors appointment so can I pass to you if it overruns significantly longer? [21:43:44] it's taking longer because of i18n updates [21:44:01] it's got past the docker image builds now [21:44:07] yep which I hadn't planned for :) [21:44:08] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - Status - issue on dse-k8s-worker1003:9290 - https://phabricator.wikimedia.org/T406929#11262530 (10VRiley-WMF) a:03VRiley-WMF [21:44:38] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - Status - issue on dse-k8s-worker1003:9290 - https://phabricator.wikimedia.org/T406929#11262531 (10VRiley-WMF) 05Open→03Resolved This is resolved. Loose power cable [21:47:05] ok looks like its almost done. I should be able to hang around and still make my appointment [21:47:54] !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1195049|Enable instrumentation of watchstar and other links that stopPropagation (T406390)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:47:57] T406390: Pull data on watchlist star usage - https://phabricator.wikimedia.org/T406390 [21:49:48] Yeah sorry, unfortunately I don't think it was possible to do this without the i18n updates [21:50:10] okay changes are on test servers. Mine are working [21:50:19] @Daimona can you test? [21:50:35] Apparently they only synced to test servers [21:51:24] Yep, appears to be working correctly, thank you! [21:51:27] !log jdlrobson@deploy2002 jdlrobson: Continuing with sync [21:51:29] ok syncing [21:57:43] (03PS25) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [21:57:53] (03CR) 10Scott French: "Many thanks to you both for the review, as well as for the tip about the rest-gateway development setup. The latter helped identify two is" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [21:59:38] ok i really have to go now [21:59:47] could someone keep an eye on spiderpig for me while I'm gone? [22:00:00] Reedy: ? [22:00:07] I'm watching, yeah [22:00:10] Sorry about this! [22:00:25] thank you and appreciate you! gotta run [22:00:34] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1018.eqiad.wmnet [22:02:43] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [22:04:06] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195049|Enable instrumentation of watchstar and other links that stopPropagation (T406390)]] (duration: 41m 38s) [22:04:10] T406390: Pull data on watchlist star usage - https://phabricator.wikimedia.org/T406390 [22:04:54] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1020.eqiad.wmnet [22:05:08] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1019.eqiad.wmnet [22:07:43] FIRING: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [22:08:50] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [22:09:02] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [22:09:35] Thanks Jon! [22:11:22] !log bking@wdqs10(18|19|20) systemctl start load-categories-daily.service T405978 [22:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:26] T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978 [22:11:32] (03PS1) 10Dzahn: httpbb: add minimal tests for new zuul-web [puppet] - 10https://gerrit.wikimedia.org/r/1195067 (https://phabricator.wikimedia.org/T405119) [22:12:40] (03PS2) 10Dzahn: httpbb: add minimal tests for new zuul-web [puppet] - 10https://gerrit.wikimedia.org/r/1195067 (https://phabricator.wikimedia.org/T405119) [22:13:48] (03PS3) 10Dzahn: httpbb: add minimal tests for new zuul-web [puppet] - 10https://gerrit.wikimedia.org/r/1195067 (https://phabricator.wikimedia.org/T405119) [22:13:50] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [22:13:55] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:22:38] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1195067/7244/deploy1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1195067 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [22:24:02] FIRING: KubernetesCalicoDown: dse-k8s-worker2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:28:50] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:28:50] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:28] reedy: any idea how much longer it might take? [22:29:39] Daimona: How much longer for what? [22:29:54] Isn't the deployment still running? [22:29:59] It finished 25 mins ago [22:30:15] ... You can tell it hasn't been a good day for me lol [22:30:23] (03PS1) 10Dzahn: httpbb: add missing directory for new zuul tests [puppet] - 10https://gerrit.wikimedia.org/r/1195073 (https://phabricator.wikimedia.org/T405119) [22:31:02] (03CR) 10Dzahn: [C:03+2] httpbb: add missing directory for new zuul tests [puppet] - 10https://gerrit.wikimedia.org/r/1195073 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [22:31:20] (In my defense, the SAL entry doesn't link to my patch) [22:32:32] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195073" [puppet] - 10https://gerrit.wikimedia.org/r/1195067 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [22:33:13] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [22:33:37] (03CR) 10Dzahn: [V:03+1 C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1195073 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [22:34:51] leeeeroy jenkins [23:01:37] marostegui@cumin1003 clone_es (PID 1943886) is awaiting input [23:10:33] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2017.* [23:13:50] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:18:50] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:38:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1195079 [23:38:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1195079 (owner: 10TrainBranchBot) [23:51:38] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1195079 (owner: 10TrainBranchBot)