[00:00:27] <wikibugs>	 (03Merged) 10jenkins-bot: mesh.configuration: Fix a typo in the OTel service_name template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194784 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[00:00:30] <wikibugs>	 (03PS1) 10RLazarus: all charts: Update mesh.configuration 1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194786 (https://phabricator.wikimedia.org/T404036)
[00:08:45] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1194787
[00:08:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1194787 (owner: 10TrainBranchBot)
[00:08:53] <icinga-wm>	 PROBLEM - dump of s3 in codfw on backupmon1001 is CRITICAL: dump for s3 at codfw (db2239) taken more than a week ago: Most recent backup 2025-09-30 00:00:14 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:14:01] <wikibugs>	 06SRE, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th): Set up a working, usable dbt installation on stat boxes - https://phabricator.wikimedia.org/T406634#11257524 (10Ahoelzl)
[00:16:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] all charts: Update mesh.configuration 1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194786 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[00:19:08] <wikibugs>	 (03CR) 10RLazarus: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194786 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[00:30:27] <icinga-wm>	 PROBLEM - dump of s1 in codfw on backupmon1001 is CRITICAL: dump for s1 at codfw (db2141) taken more than a week ago: Most recent backup 2025-09-30 00:00:14 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:33:45] <wikibugs>	 (03CR) 10Scott French: [C:03+1] all charts: Update mesh.configuration 1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194786 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[00:34:21] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] all charts: Update mesh.configuration 1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194786 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[00:44:02] <wikibugs>	 (03Merged) 10jenkins-bot: all charts: Update mesh.configuration 1.14.2 to 1.14.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194786 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus)
[00:44:54] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[00:48:48] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1194787 (owner: 10TrainBranchBot)
[00:48:55] <rzl>	 helmfile deployments are all clear again 👍
[01:00:50] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:04:56] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406799 (10phaultfinder) 03NEW
[01:15:10] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 14m 20s)
[01:19:57] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406799#11257695 (10phaultfinder)
[01:36:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:41:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:51:44] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1020:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[01:54:12] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:54:20] <mutante>	 !log [wdqs1020:~] $ sudo systemctl restart wdqs-blazegraph
[01:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:54:23] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-categories on wdqs1018 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:54:23] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1018 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:54:23] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:54:23] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:54:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[01:57:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:58:07] <icinga-wm>	 PROBLEM - dump of s4 in codfw on backupmon1001 is CRITICAL: dump for s4 at codfw (db2239) taken more than a week ago: Most recent backup 2025-09-30 01:36:23 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:58:47] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-categories on wdqs1019 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:58:47] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1019 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:58:47] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:58:47] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[01:59:12] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:59:48] <jinxer-wm>	 FIRING: [2x] PuppetFailure: Puppet has failed on wdqs1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[02:01:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[02:02:17] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:02:26] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release 20251008
[02:04:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406799#11257749 (10phaultfinder)
[02:06:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[02:09:54] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[02:11:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[02:11:39] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: security release 20251008
[02:27:02] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[02:27:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:46:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[02:51:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[02:59:12] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:01:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[03:04:12] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:11:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1019:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[03:19:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:21:09] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:24:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:46:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[03:51:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[04:44:54] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:00:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:01:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1019:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[05:09:12] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:11:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1019:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[05:16:21] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1030 gradually with 4 steps - Pool es1030.eqiad.wmnet in after cloning
[05:16:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[05:16:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:19:11] <logmsgbot>	 marostegui@cumin1003 clone_es (PID 1785807) is awaiting input
[05:21:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[05:28:46] <logmsgbot>	 marostegui@cumin1003 clone_es (PID 1785807) is awaiting input
[05:34:12] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:35:52] <logmsgbot>	 marostegui@cumin1003 clone_es (PID 1785807) is awaiting input
[05:36:14] <wikibugs>	 (03PS1) 10Marostegui: db2155: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194806 (https://phabricator.wikimedia.org/T406541)
[05:36:33] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1027 gradually with 4 steps - Pool es1027.eqiad.wmnet in after cloning
[05:37:03] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2155: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194806 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui)
[05:37:26] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2155.codfw.wmnet with reason: Maintenance
[05:37:30] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2155 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83686 and previous config saved to /var/cache/conftool/dbconfig/20251009-053730-marostegui.json
[05:40:08] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add es1050 and es1053 [puppet] - 10https://gerrit.wikimedia.org/r/1194808 (https://phabricator.wikimedia.org/T406488)
[05:41:04] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1050 and es1053 [puppet] - 10https://gerrit.wikimedia.org/r/1194808 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui)
[05:41:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:43:47] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Add es1050 and es1053 depooled T406488', diff saved to https://phabricator.wikimedia.org/P83687 and previous config saved to /var/cache/conftool/dbconfig/20251009-054347-marostegui.json
[05:43:51] <stashbot>	 T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488
[05:45:49] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2155 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83688 and previous config saved to /var/cache/conftool/dbconfig/20251009-054548-root.json
[05:46:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[05:47:12] <wikibugs>	 (03PS1) 10Marostegui: installserver: Remove es1051 [puppet] - 10https://gerrit.wikimedia.org/r/1194812 (https://phabricator.wikimedia.org/T406488)
[05:51:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Remove es1051 [puppet] - 10https://gerrit.wikimedia.org/r/1194812 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui)
[05:51:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[05:52:49] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: mod_qos revert to previous stable state [puppet] - 10https://gerrit.wikimedia.org/r/1194811 (https://phabricator.wikimedia.org/T406403)
[05:59:48] <jinxer-wm>	 FIRING: [2x] PuppetFailure: Puppet has failed on wdqs1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T0600).
[06:00:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2155 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83691 and previous config saved to /var/cache/conftool/dbconfig/20251009-060054-root.json
[06:01:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[06:01:50] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1030 gradually with 4 steps - Pool es1030.eqiad.wmnet in after cloning
[06:01:51] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es1030.eqiad.wmnet onto es1053.eqiad.wmnet
[06:02:32] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:06:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:09:54] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:09:54] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: mod_qos tweaks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193597 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb)
[06:11:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: wdqs-blazegraph.service crashloop on wdqs1019:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[06:15:33] <wikibugs>	 (03CR) 10Arnaudb: gerrit: increase QS_ClientPrefer threshold (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194702 (owner: 10Dzahn)
[06:16:01] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2155 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83694 and previous config saved to /var/cache/conftool/dbconfig/20251009-061600-root.json
[06:22:02] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1027 gradually with 4 steps - Pool es1027.eqiad.wmnet in after cloning
[06:22:02] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es1027.eqiad.wmnet onto es1050.eqiad.wmnet
[06:26:19] <logmsgbot>	 !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-main host
[06:26:33] <logmsgbot>	 !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-main host (duration: 00m 14s)
[06:26:43] <logmsgbot>	 !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-main host
[06:26:56] <logmsgbot>	 !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@fea7794]: deploy to fresh wdqs-main host (duration: 00m 13s)
[06:27:02] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[06:27:23] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T405978, transfer to freshly reimaged host) xfer wikidata_main from wdqs1017.eqiad.wmnet -> wdqs1018.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[06:27:23] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-categories on wdqs1018 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:27:23] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1018 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:27:23] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1018 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:27:23] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-categories- on wdqs1018 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:27:25] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T405978, transfer to freshly reimaged host) xfer wikidata_main from wdqs1017.eqiad.wmnet -> wdqs1018.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[06:27:27] <stashbot>	 T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978
[06:27:35] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T405978, transfer to freshly reimaged host) xfer wikidata_main from wdqs1017.eqiad.wmnet -> wdqs1018.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[06:27:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:27:47] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1019 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:27:47] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-categories on wdqs1019 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:27:47] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-categories- on wdqs1019 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:27:47] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1019 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:28:36] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T405978, transfer to freshly reimaged host) xfer wikidata_main from wdqs1020.eqiad.wmnet -> wdqs1019.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[06:28:49] <icinga-wm>	 RECOVERY - Backup freshness on backup1014 is OK: Fresh: 146 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:28:58] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=no:weight=10; selector: name=wdqs1019.*
[06:31:07] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2155 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83696 and previous config saved to /var/cache/conftool/dbconfig/20251009-063106-root.json
[06:36:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] site.pp: reimage all hcaptcha nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/1194715 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh)
[06:36:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] conftool-data: add hcaptcha[12]00[12].wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1194722 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh)
[06:39:24] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica
[06:48:34] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica
[06:50:32] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1019.eqiad.wmnet with OS bullseye
[06:53:01] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab
[06:58:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me. The new FastFloat lib isn't marked as PIC, whole folly uses it which itself does use PIC? But if the build resuilt works" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1194687 (https://phabricator.wikimedia.org/T406522) (owner: 10Andrew Bogott)
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T0700).
[07:00:04] <jouncebot>	 lmora: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:07] <wikibugs>	 (03Abandoned) 10Slyngshede: P:cache::haproxy add datacenter information to provenance [puppet] - 10https://gerrit.wikimedia.org/r/1184497 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[07:00:58] <kostajh>	 I'm adding a patch to the window 
[07:01:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194666 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan)
[07:01:46] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192533 (https://phabricator.wikimedia.org/T387600) (owner: 10Kosta Harlan)
[07:02:42] <wikibugs>	 (03CR) 10Kosta Harlan: Add ReadingList Stream to EventStreamConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora)
[07:03:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194666 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan)
[07:04:40] <wikibugs>	 (03PS13) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161)
[07:05:54] <moritzm>	 !log installing Redis security updates
[07:05:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:52] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[07:14:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1034 and es1029 T406488', diff saved to https://phabricator.wikimedia.org/P83697 and previous config saved to /var/cache/conftool/dbconfig/20251009-071430-marostegui.json
[07:14:35] <stashbot>	 T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488
[07:15:53] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1029,1034].eqiad.wmnet with reason: Cloning
[07:16:09] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type [extensions/ConfirmEdit] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194666 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan)
[07:17:17] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1194666|hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type (T404204)]]
[07:17:19] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize es1052 [puppet] - 10https://gerrit.wikimedia.org/r/1194822 (https://phabricator.wikimedia.org/T406488)
[07:17:20] <stashbot>	 T404204: Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204
[07:18:22] <wikibugs>	 (03PS14) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161)
[07:18:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194733 (owner: 10Bearloga)
[07:19:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1052 [puppet] - 10https://gerrit.wikimedia.org/r/1194822 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui)
[07:20:10] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T405978, transfer to freshly reimaged host) xfer wikidata_main from wdqs1017.eqiad.wmnet -> wdqs1018.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[07:20:14] <stashbot>	 T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978
[07:20:31] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[07:20:54] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T405978, transfer to freshly reimaged host) xfer wikidata_main from wdqs1020.eqiad.wmnet -> wdqs1019.eqiad.wmnet w/ force delete existing files, repooling both afterwards
[07:21:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1019:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[07:22:12] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1194666|hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type (T404204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:24:12] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.clone_es of es1029.eqiad.wmnet onto es1052.eqiad.wmnet
[07:24:12] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: wdqs-blazegraph.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:24:17] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:25:06] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[07:25:36] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1194823 (https://phabricator.wikimedia.org/T406488)
[07:27:11] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1057 [puppet] - 10https://gerrit.wikimedia.org/r/1194823 (https://phabricator.wikimedia.org/T406488) (owner: 10Marostegui)
[07:29:11] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194666|hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type (T404204)]] (duration: 11m 54s)
[07:29:15] <stashbot>	 T404204: Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204
[07:31:08] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.clone_es of es1034.eqiad.wmnet onto es1057.eqiad.wmnet
[07:32:55] <kostajh>	 On to the next ones 
[07:33:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192533 (https://phabricator.wikimedia.org/T387600) (owner: 10Kosta Harlan)
[07:33:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194733 (owner: 10Bearloga)
[07:34:39] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C:03+2] tests: Remove usage of ReflectionProperty::setAccessible(), no-op [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194749 (https://phabricator.wikimedia.org/T406744) (owner: 10Jforrester)
[07:35:07] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig: Fix user-agent exclusion config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192533 (https://phabricator.wikimedia.org/T387600) (owner: 10Kosta Harlan)
[07:35:10] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig: fix IP auto reveal stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194733 (owner: 10Bearloga)
[07:35:43] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192533|EventStreamConfig: Fix user-agent exclusion config (T387600)]], [[gerrit:1194733|EventStreamConfig: fix IP auto reveal stream]]
[07:35:46] <stashbot>	 T387600: IP Auto-reveal: Agree and implement metrics and instrumentation plan - https://phabricator.wikimedia.org/T387600
[07:36:02] <wikibugs>	 (03Merged) 10jenkins-bot: tests: Remove usage of ReflectionProperty::setAccessible(), no-op [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194749 (https://phabricator.wikimedia.org/T406744) (owner: 10Jforrester)
[07:39:01] <wikibugs>	 (03PS1) 10Marostegui: db2147: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194824 (https://phabricator.wikimedia.org/T406541)
[07:40:10] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2147: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194824 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui)
[07:40:47] <logmsgbot>	 !log kharlan@deploy2002 kharlan, bearloga: Backport for [[gerrit:1192533|EventStreamConfig: Fix user-agent exclusion config (T387600)]], [[gerrit:1194733|EventStreamConfig: fix IP auto reveal stream]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:40:50] <stashbot>	 T387600: IP Auto-reveal: Agree and implement metrics and instrumentation plan - https://phabricator.wikimedia.org/T387600
[07:40:52] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2147.codfw.wmnet with reason: Maintenance
[07:40:56] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2147 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83698 and previous config saved to /var/cache/conftool/dbconfig/20251009-074055-marostegui.json
[07:42:46] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1018.eqiad.wmnet with OS bullseye
[07:43:40] <logmsgbot>	 !log kharlan@deploy2002 kharlan, bearloga: Continuing with sync
[07:44:12] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:44:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs1018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:47:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr2-eqiad: fan failure on left tray [Oct 2025] - https://phabricator.wikimedia.org/T406554#11258314 (10Jclark-ctr) a:05VRiley-WMF→03Jclark-ctr
[07:47:36] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192533|EventStreamConfig: Fix user-agent exclusion config (T387600)]], [[gerrit:1194733|EventStreamConfig: fix IP auto reveal stream]] (duration: 11m 53s)
[07:47:40] <stashbot>	 T387600: IP Auto-reveal: Agree and implement metrics and instrumentation plan - https://phabricator.wikimedia.org/T387600
[07:48:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan)
[07:49:02] <logmsgbot>	 jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade.
[07:49:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan)
[07:49:15] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2147 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83699 and previous config saved to /var/cache/conftool/dbconfig/20251009-074914-root.json
[07:49:43] <wikibugs>	 (03Merged) 10jenkins-bot: ConfirmEdit/hCaptcha: Implement automatic failover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194671 (https://phabricator.wikimedia.org/T404204) (owner: 10Kosta Harlan)
[07:51:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[07:51:32] <kostajh>	 I am seeing "unexpected commits pulled from origin for /srv/mediawiki-staging" 
[07:51:38] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@af75327] (hadoop-test): Analytics deploy - druid pageviews_daily - TEST [analytics/refinery@af753272]
[07:51:47] <kostajh>	 hashar: do you have some advice? 
[07:52:21] <hashar>	 I did not even had my coffee yet! :D
[07:52:30] <hashar>	 I don't know what that means, that is on spiderpig output?
[07:52:32] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@af75327] (hadoop-test): Analytics deploy - druid pageviews_daily - TEST [analytics/refinery@af753272] (duration: 00m 54s)
[07:52:38] <kostajh>	 yeah
[07:52:44] <kostajh>	 oh it looks to be from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1194749
[07:52:45] <kostajh>	 so it's fine
[07:53:00] <kostajh>	 proceeding
[07:53:05] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@af75327]: Analytics deploy - druid pageviews_daily  [analytics/refinery@af753272]
[07:53:17] <hashar>	 hmm
[07:53:19] <logmsgbot>	 !log kharlan@deploy2002 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /usr/local/bin/update-mediawiki-tools-release' returned non-zero exit status 1. (scap version: 4.213.0) (duration: 00m 00s)
[07:53:32] <hashar>	 so yeah that change has not been deployed
[07:53:34] <kostajh>	 `fatal: unable to access 'https://gitlab.wikimedia.org/repos/releng/release.git/': The requested URL returned error: 502` 
[07:53:37] <kostajh>	 hmm
[07:53:51] <hashar>	 then I thought scap was smart enough to detect a change was a noop (cause it only touches beta or tests)
[07:54:12] <kostajh>	 I'm trying again, assuming the gitlab access issue was transient 
[07:54:34] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1194671|ConfirmEdit/hCaptcha: Implement automatic failover (T404204)]]
[07:54:35] <hashar>	 or maybe the skip only happens when one attempts to backport said patch (eg `scap backport 1194749`) 
[07:54:37] <stashbot>	 T404204: Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204
[07:56:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:04-1] "Overall LGTM: change the management of the lua private files directory as I suggested and you get my +1." [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[07:56:58] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@af75327]: Analytics deploy - druid pageviews_daily  [analytics/refinery@af753272] (duration: 03m 53s)
[07:57:15] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@af75327] (thin): Analytics deploy - druid pageviews_daily - THIN [analytics/refinery@af753272]
[07:57:20] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab
[07:57:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:57:27] <wikibugs>	 (03PS1) 10Slyngshede: P:idp prepare for new Trixie hosts [puppet] - 10https://gerrit.wikimedia.org/r/1194829 (https://phabricator.wikimedia.org/T406455)
[07:58:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:04-1] "Sorry, re-thinking about it, I found a bigger issue: the routines called in the private repo can be quite expensive, so I'd only run that " [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[07:59:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove maps roles from maps-test* [puppet] - 10https://gerrit.wikimedia.org/r/1194842 (https://phabricator.wikimedia.org/T381565)
[07:59:09] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1194671|ConfirmEdit/hCaptcha: Implement automatic failover (T404204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:59:25] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@af75327] (thin): Analytics deploy - druid pageviews_daily - THIN [analytics/refinery@af753272] (duration: 02m 10s)
[08:00:05] <jouncebot>	 jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T0800)
[08:00:46] <jnuche>	 morning, it looks like there's a backport still going on, so holding the train for now
[08:00:49] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hardware.upgrade-firmware: fix ssd upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1193818 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[08:00:57] <wikibugs>	 (03CR) 10Marostegui: "It looks good to me, do you want a host to test? I can give you one from s4 as I am currently doing that section" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto)
[08:02:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:02:27] <wikibugs>	 (03PS15) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161)
[08:03:42] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[08:04:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2147 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83700 and previous config saved to /var/cache/conftool/dbconfig/20251009-080420-root.json
[08:05:51] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove maps roles from maps-test* [puppet] - 10https://gerrit.wikimedia.org/r/1194842 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:07:48] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194671|ConfirmEdit/hCaptcha: Implement automatic failover (T404204)]] (duration: 13m 14s)
[08:07:51] <stashbot>	 T404204: Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204
[08:09:34] <hashar>	 jnuche: good morning!  It seems like kostajh backport has completed
[08:09:46] <kostajh>	 I'm done 
[08:09:53] <hashar>	 content transformer reverted the patch that inserted some `<i>` in the table of content :-)
[08:10:03] <jnuche>	 good morning, ack, I'll roll out the train in a few minutes
[08:10:21] <hashar>	 and Subbu was super happy doing it over SpiderPig
[08:10:34] <jnuche>	 nice
[08:11:39] <wikibugs>	 (03PS16) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161)
[08:12:33] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2044']
[08:12:47] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2044']
[08:13:01] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[08:13:34] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194875 (https://phabricator.wikimedia.org/T405678)
[08:13:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194875 (https://phabricator.wikimedia.org/T405678) (owner: 10TrainBranchBot)
[08:14:32] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194875 (https://phabricator.wikimedia.org/T405678) (owner: 10TrainBranchBot)
[08:14:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove maps roles from maps-test* [puppet] - 10https://gerrit.wikimedia.org/r/1194842 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:15:46] <wikibugs>	 (03PS17) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161)
[08:15:56] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[08:18:12] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2043.codfw.wmnet']
[08:18:36] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2043.codfw.wmnet']
[08:18:42] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[08:19:14] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2044.codfw.wmnet']
[08:19:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2147 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83701 and previous config saved to /var/cache/conftool/dbconfig/20251009-081926-root.json
[08:19:41] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2044.codfw.wmnet']
[08:19:45] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7238/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[08:19:56] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[08:20:08] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] P:cache::haproxy copy private repo data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[08:20:13] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[08:22:58] <logmsgbot>	 !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.22  refs T405678
[08:23:04] <stashbot>	 T405678: 1.45.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T405678
[08:26:19] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[08:26:57] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[08:27:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:30:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11258496 (10cmooney) >>! In T404959#11255300, @VRiley-WMF wrote: > Okay, was looking at this issue a bit. There are currently two fiber...
[08:31:31] <wikibugs>	 (03PS1) 10Elukey: sre.hardware.upgrade-firmware: add comment [cookbooks] - 10https://gerrit.wikimedia.org/r/1194876
[08:31:46] <wikibugs>	 (03PS18) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161)
[08:34:29] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[08:34:33] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2147 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83702 and previous config saved to /var/cache/conftool/dbconfig/20251009-083432-root.json
[08:36:32] <wikibugs>	 (03PS1) 10Marostegui: db2179: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194877 (https://phabricator.wikimedia.org/T406541)
[08:37:04] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2179: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194877 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui)
[08:37:57] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2179.codfw.wmnet with reason: Maintenance
[08:38:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2179 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83703 and previous config saved to /var/cache/conftool/dbconfig/20251009-083801-marostegui.json
[08:39:49] <wikibugs>	 (03PS1) 10Elukey: admin_ng: add the amdgpu-node-labeller clusterrole as optional RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194878 (https://phabricator.wikimedia.org/T373806)
[08:41:25] <wikibugs>	 (03PS2) 10Elukey: sre.hardware.upgrade-firmware: add comment [cookbooks] - 10https://gerrit.wikimedia.org/r/1194876
[08:41:35] <wikibugs>	 (03Abandoned) 10Federico Ceratto: instances.yaml: Add es2049 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1184091 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[08:42:22] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06collaboration-services, 10Data-Persistence-Backup: Evaluate generic backup tooling for object storage buckets - https://phabricator.wikimedia.org/T406824 (10Jelto) 03NEW
[08:42:42] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11258575 (10Jelto)
[08:43:16] <wikibugs>	 (03PS2) 10Elukey: admin_ng: add the amdgpu-node-labeller clusterrole as optional RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194878 (https://phabricator.wikimedia.org/T373806)
[08:44:29] <wikibugs>	 (03CR) 10Marostegui: "I've been using this with a few hosts and works very nicely, what's pending to get it merged?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto)
[08:44:54] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[08:44:59] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[08:46:15] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2179 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83704 and previous config saved to /var/cache/conftool/dbconfig/20251009-084614-root.json
[08:48:12] <wikibugs>	 (03PS1) 10Elukey: preseed: set ms-be2078 for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1194880 (https://phabricator.wikimedia.org/T404356)
[08:48:22] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11258584 (10Jelto) 05Open→03Resolved Great, then I'll resolve this task. I opened {T406824} as a follow up to track the obj...
[08:48:38] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Add banned IP list [puppet] - 10https://gerrit.wikimedia.org/r/1194881 (https://phabricator.wikimedia.org/T283948)
[08:48:40] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Add per-IP rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1194882 (https://phabricator.wikimedia.org/T283948)
[08:48:51] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[08:49:43] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hardware.upgrade-firmware: add comment [cookbooks] - 10https://gerrit.wikimedia.org/r/1194876 (owner: 10Elukey)
[08:50:23] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::k8s::haproxy: Add per-IP rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1194882 (https://phabricator.wikimedia.org/T283948)
[08:50:26] <wikibugs>	 (03PS5) 10Federico Ceratto: sanitize-wiki.py: Improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1191689 (https://phabricator.wikimedia.org/T366146)
[08:51:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] sanitize-wiki.py: Improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1191689 (https://phabricator.wikimedia.org/T366146) (owner: 10Federico Ceratto)
[08:52:15] <logmsgbot>	 !log elukey@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp2050.codfw.wmnet']
[08:52:21] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2050.codfw.wmnet']
[08:52:34] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2050.codfw.wmnet']
[08:53:09] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2044.codfw.wmnet']
[08:53:55] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2044.codfw.wmnet']
[08:54:49] <wikibugs>	 (03CR) 10Federico Ceratto: "It's pending review if someone wants to take the time ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto)
[09:01:03] <wikibugs>	 (03PS1) 10Elukey: sre.hardware.upgrade-firmware: use lower when matching firmware versions [cookbooks] - 10https://gerrit.wikimedia.org/r/1194883 (https://phabricator.wikimedia.org/T392851)
[09:01:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2179 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83705 and previous config saved to /var/cache/conftool/dbconfig/20251009-090120-root.json
[09:03:36] <wikibugs>	 (03PS1) 10Marostegui: db1252: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194884 (https://phabricator.wikimedia.org/T406541)
[09:04:15] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1252: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1194884 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui)
[09:04:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Syncronise the Hiera settings for the bookworm maps masters to the role settings [puppet] - 10https://gerrit.wikimedia.org/r/1194886 (https://phabricator.wikimedia.org/T381565)
[09:05:12] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1252.eqiad.wmnet with reason: Maintenance
[09:05:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1252 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83706 and previous config saved to /var/cache/conftool/dbconfig/20251009-090516-marostegui.json
[09:06:17] <wikibugs>	 (03CR) 10Elukey: "prerequisite: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194639" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194878 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey)
[09:10:50] <wikibugs>	 (03PS2) 10Federico Ceratto: preseed.yaml: Remove es2053 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1194643 (https://phabricator.wikimedia.org/T402859)
[09:10:50] <wikibugs>	 (03PS2) 10Federico Ceratto: es2053.yaml: Prepare es2053 for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1194644 (https://phabricator.wikimedia.org/T402859)
[09:12:33] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194886 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:13:20] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra::haproxy: Use default config template [puppet] - 10https://gerrit.wikimedia.org/r/1194887
[09:13:22] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1252 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83707 and previous config saved to /var/cache/conftool/dbconfig/20251009-091322-root.json
[09:13:30] <wikibugs>	 (03CR) 10Marostegui: clone_es.py: clone readonly es* hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto)
[09:14:16] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7239/co" [puppet] - 10https://gerrit.wikimedia.org/r/1194887 (owner: 10Majavah)
[09:15:13] <wikibugs>	 (03CR) 10Klausman: [C:03+1] profile::amd_gpu: add initial support for the k8s node labeller [puppet] - 10https://gerrit.wikimedia.org/r/1194639 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey)
[09:15:19] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: add the amdgpu-node-labeller clusterrole as optional RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194878 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey)
[09:16:26] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2179 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83708 and previous config saved to /var/cache/conftool/dbconfig/20251009-091626-root.json
[09:17:09] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1194883 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[09:17:12] <kostajh>	 jouncebot: nowandnext
[09:17:12] <jouncebot>	 For the next 0 hour(s) and 42 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T0800)
[09:17:12] <jouncebot>	 In 0 hour(s) and 42 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1000)
[09:17:24] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hardware.upgrade-firmware: use lower when matching firmware versions [cookbooks] - 10https://gerrit.wikimedia.org/r/1194883 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[09:17:40] <kostajh>	 jnuche / hashar can I backport the patch for T406707 ? 
[09:17:40] <stashbot>	 T406707: PHP Warning: Undefined array key 20250928162850 - https://phabricator.wikimedia.org/T406707
[09:18:03] <wikibugs>	 (03PS1) 10Kosta Harlan: Check against correct key in sortEntitiesByTimestamp [extensions/CheckUser] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194890 (https://phabricator.wikimedia.org/T406707)
[09:18:34] <jnuche>	 kostajh: yeah, fine by me
[09:18:59] <kostajh>	 ok, I will start that then
[09:19:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194890 (https://phabricator.wikimedia.org/T406707) (owner: 10Kosta Harlan)
[09:21:03] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2044.codfw.wmnet']
[09:21:19] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2044.codfw.wmnet']
[09:22:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::k8s::haproxy: Add banned IP list [puppet] - 10https://gerrit.wikimedia.org/r/1194881 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah)
[09:22:34] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Add banned IP list [puppet] - 10https://gerrit.wikimedia.org/r/1194881 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah)
[09:22:42] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Add per-IP rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1194882 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah)
[09:23:05] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "nice job" [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427) (owner: 10Fabfur)
[09:23:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::k8s::haproxy: Add per-IP rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1194882 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah)
[09:23:38] <wikibugs>	 (03PS3) 10Majavah: P:toolforge::k8s::haproxy: Add per-IP rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1194882 (https://phabricator.wikimedia.org/T283948)
[09:23:38] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2045.codfw.wmnet']
[09:23:56] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2045.codfw.wmnet']
[09:24:03] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2046.codfw.wmnet']
[09:24:03] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::metricsinfra::haproxy: Use default config template [puppet] - 10https://gerrit.wikimedia.org/r/1194887
[09:25:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Syncronise the Hiera settings for the bookworm maps replicas to the role settings [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565)
[09:26:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Syncronise the Hiera settings for the bookworm maps replicas to the role settings [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:26:51] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Add per-IP rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1194882 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah)
[09:27:17] <wikibugs>	 (03PS2) 10Muehlenhoff: Syncronise the Hiera settings for the bookworm maps replicas [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565)
[09:27:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1194880 (https://phabricator.wikimedia.org/T404356) (owner: 10Elukey)
[09:28:28] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1252 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83709 and previous config saved to /var/cache/conftool/dbconfig/20251009-092827-root.json
[09:28:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1194829 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede)
[09:29:33] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra::haproxy: Use default config template [puppet] - 10https://gerrit.wikimedia.org/r/1194887 (owner: 10Majavah)
[09:29:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:30:09] <wikibugs>	 (03CR) 10Elukey: [C:03+2] preseed: set ms-be2078 for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1194880 (https://phabricator.wikimedia.org/T404356) (owner: 10Elukey)
[09:30:36] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2046.codfw.wmnet']
[09:31:32] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db2179 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83711 and previous config saved to /var/cache/conftool/dbconfig/20251009-093131-root.json
[09:32:35] <wikibugs>	 (03Merged) 10jenkins-bot: Check against correct key in sortEntitiesByTimestamp [extensions/CheckUser] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194890 (https://phabricator.wikimedia.org/T406707) (owner: 10Kosta Harlan)
[09:32:52] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1194890|Check against correct key in sortEntitiesByTimestamp (T406707)]]
[09:32:56] <stashbot>	 T406707: PHP Warning: Undefined array key 20250928162850 - https://phabricator.wikimedia.org/T406707
[09:34:05] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.provision: fix issue when moving a Dell host to UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892
[09:35:42] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:35:46] <wikibugs>	 (03PS2) 10Tiziano Fogli: Enable profile::auto_restarts::service for the metamonitoring endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1194578 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:36:36] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[09:36:39] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1194890|Check against correct key in sortEntitiesByTimestamp (T406707)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:37:10] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[09:37:26] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[09:38:18] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[09:38:19] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[09:38:50] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] Enable profile::auto_restarts::service for the metamonitoring endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1194578 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:39:20] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[09:39:58] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[09:40:07] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] check_gdnsd_checkconf: enable nrpe wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1184469 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli)
[09:40:29] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[09:41:40] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:42:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for the metamonitoring endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1194578 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:43:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1252 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83713 and previous config saved to /var/cache/conftool/dbconfig/20251009-094333-root.json
[09:44:11] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194890|Check against correct key in sortEntitiesByTimestamp (T406707)]] (duration: 11m 18s)
[09:44:14] <stashbot>	 T406707: PHP Warning: Undefined array key 20250928162850 - https://phabricator.wikimedia.org/T406707
[09:48:49] <wikibugs>	 (03PS1) 10Majavah: haproxy: Remove separate cloud::base class [puppet] - 10https://gerrit.wikimedia.org/r/1194895 (https://phabricator.wikimedia.org/T284558)
[09:50:22] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7240/" [puppet] - 10https://gerrit.wikimedia.org/r/1194895 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah)
[09:52:13] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "seems like zuul-haproxy-01.zuul.eqiad1.wikimedia.cloud uses a custom puppetserver which is not hooked up to PCC?" [puppet] - 10https://gerrit.wikimedia.org/r/1194895 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah)
[09:52:14] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz)
[09:52:54] <wikibugs>	 (03PS5) 10Aaron Schulz: Route "/api/rest_v1/?spec" requests to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203)
[09:53:56] <wikibugs>	 (03CR) 10Clément Goubert: "Small change needed since we introduced fractional routing" [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz)
[09:54:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "This looks good, the same change was already applied for the new EPP template used on Trixie and later (see comment inline)." [puppet] - 10https://gerrit.wikimedia.org/r/1194734 (owner: 10JHathaway)
[09:55:46] <wikibugs>	 (03PS3) 10Hnowlan: api-gateway: Add support for PHP_ENGINE cookie routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (owner: 10Scott French)
[09:55:46] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "This is so much neater than I expected it to be, thank you for doing it!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (owner: 10Scott French)
[09:58:40] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'db1252 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83715 and previous config saved to /var/cache/conftool/dbconfig/20251009-095839-root.json
[09:59:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Just checked on Bullseye: even if someone uses SSH 8.4 (the version shipped in Bullseye), the NIST variants aren't in the default value of" [puppet] - 10https://gerrit.wikimedia.org/r/1194734 (owner: 10JHathaway)
[09:59:56] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] api-gateway: Add support for PHP_ENGINE cookie routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (owner: 10Scott French)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1000)
[10:01:29] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[10:01:35] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[10:02:12] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[10:04:08] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Handle transform/wikitext/to/lint(.*) requests routed to the gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189938 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz)
[10:04:27] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] K8s reverse DNS delegation: remove wikikube-ctrl1001 and add new nets [dns] - 10https://gerrit.wikimedia.org/r/1194678 (https://phabricator.wikimedia.org/T383227) (owner: 10Cathal Mooney)
[10:04:45] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Syncronise the Hiera settings for the bookworm maps replicas [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[10:05:33] <icinga-wm>	 PROBLEM - Host ms-be2078 is DOWN: PING CRITICAL - Packet loss = 100%
[10:05:35] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Syncronise the Hiera settings for the bookworm maps masters to the role settings [puppet] - 10https://gerrit.wikimedia.org/r/1194886 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[10:05:58] <wikibugs>	 (03Merged) 10jenkins-bot: Handle transform/wikitext/to/lint(.*) requests routed to the gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189938 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz)
[10:08:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[10:08:40] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[10:08:49] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[10:09:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Nice cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/1194895 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah)
[10:09:01] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[10:09:20] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] haproxy: Remove separate cloud::base class [puppet] - 10https://gerrit.wikimedia.org/r/1194895 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah)
[10:09:32] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:09:54] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:10:05] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:11:34] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:11:45] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:11:51] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:12:01] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:15:21] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:17:49] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:17:55] <wikibugs>	 10SRE-SLO, 10EditCheck, 06Editing-team (Kanban Board), 07Essential-Work, 05Goal: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11258956 (10elukey) 05Open→03Resolved Closed the task in favor of T406836, since the work is done :)
[10:20:08] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:20:19] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] profile::amd_gpu: add initial support for the k8s node labeller [puppet] - 10https://gerrit.wikimedia.org/r/1194639 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey)
[10:20:36] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:24:12] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:26:26] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: add the amdgpu-node-labeller clusterrole as optional RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194878 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey)
[10:26:44] <hnowlan>	 jouncebot: nowandnext
[10:26:44] <jouncebot>	 For the next 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1000)
[10:26:44] <jouncebot>	 In 1 hour(s) and 33 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200)
[10:26:44] <jouncebot>	 In 1 hour(s) and 33 minute(s): Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200)
[10:27:02] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:27:11] <wikibugs>	 (03PS4) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194596 (https://phabricator.wikimedia.org/T406318)
[10:28:42] <logmsgbot>	 elukey@cumin1003 provision (PID 1961053) is awaiting input
[10:29:04] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:29:07] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:30:20] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[10:33:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:34:20] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idp prepare for new Trixie hosts [puppet] - 10https://gerrit.wikimedia.org/r/1194829 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede)
[10:37:51] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:38:26] <moritzm>	 elukey: your patch leads to wide-spread Puppet failures on Hadoop nodes:
[10:38:34] <moritzm>	 Function lookup() did not find a value for the name 'profile::kubernetes::cluster_name'
[10:38:42] <moritzm>	 in /srv/puppet_code/environments/production/modules/profile/manifests/amd_gpu.pp, line: 4
[10:42:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix Hiera lookups for the node labeller on non ML roles [puppet] - 10https://gerrit.wikimedia.org/r/1194909 (https://phabricator.wikimedia.org/T373806)
[10:42:46] <moritzm>	 ^ should fix it
[10:42:53] <wikibugs>	 (03CR) 10Fabfur: haproxy: try to parse also non utf8 characters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427) (owner: 10Fabfur)
[10:43:43] <wikibugs>	 (03PS3) 10Fabfur: haproxy: try to parse also non utf8 characters [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427)
[10:46:26] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] K8s reverse DNS delegation: remove wikikube-ctrl1001 and add new nets [dns] - 10https://gerrit.wikimedia.org/r/1194678 (https://phabricator.wikimedia.org/T383227) (owner: 10Cathal Mooney)
[10:46:50] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[10:47:09] <elukey>	 moritzm: ah snap! Lemme check
[10:47:45] <logmsgbot>	 !log cmooney@dns2005 END - running authdns-update
[10:47:54] <moritzm>	 elukey: AFAICT https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194909 should fix it
[10:49:39] <elukey>	 moritzm: the problem IIUC is that String $kubernetes_cluster_name = lookup('profile::kubernetes::cluster_name'), is always executed, and it doesn't have a default value (my bad of course)
[10:49:56] <elukey>	 adding a default to undef should fix the problem, wdyt?
[10:50:49] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[10:51:44] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxy: try to parse also non utf8 characters [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427) (owner: 10Fabfur)
[10:51:45] <moritzm>	 ah, indeed. given that $kubernetes_cluster_name isn't used outside of the $enable_node_labeller conditional that should be fine
[10:52:49] <wikibugs>	 (03PS1) 10Elukey: profile::amd_gpu: relax hiera lookup for the kubernetes cluster metadata [puppet] - 10https://gerrit.wikimedia.org/r/1194910
[10:52:52] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1194811 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb)
[10:52:54] <elukey>	 moritzm: --^
[10:53:08] <elukey>	 totally my bad, I didn't run pcc on hadoop
[10:53:14] <elukey>	 I always forget we have gpus in there too
[10:54:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1194910 (owner: 10Elukey)
[10:54:44] <elukey>	 thanks for the ping and sorry for the noise
[10:54:45] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Fix Hiera lookups for the node labeller on non ML roles [puppet] - 10https://gerrit.wikimedia.org/r/1194909 (https://phabricator.wikimedia.org/T373806) (owner: 10Muehlenhoff)
[10:55:19] <moritzm>	 np, I was glad it wasn't a terrible issue with the Puppet servers instead :-)
[10:55:34] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194596 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert)
[10:56:26] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::amd_gpu: relax hiera lookup for the kubernetes cluster metadata [puppet] - 10https://gerrit.wikimedia.org/r/1194910 (owner: 10Elukey)
[10:57:14] <wikibugs>	 (03Abandoned) 10Btullis: spark: provide CRUD rights on secret for spark-deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/902391 (https://phabricator.wikimedia.org/T332908) (owner: 10Nicolas Fraison)
[10:57:48] <moritzm>	 !log installing qemu security updates
[10:57:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:45] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[11:03:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Syncronise the Hiera settings for the bookworm maps masters to the role settings [puppet] - 10https://gerrit.wikimedia.org/r/1194886 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[11:08:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Syncronise the Hiera settings for the bookworm maps replicas [puppet] - 10https://gerrit.wikimedia.org/r/1194891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[11:10:01] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] es2053.yaml: Prepare es2053 for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1194644 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[11:10:09] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] preseed.yaml: Remove es2053 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1194643 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[11:12:35] <wikibugs>	 (03PS2) 10Ladsgroup: maintain-views: Add abusefilterblockeddomainhit to allowed log types [puppet] - 10https://gerrit.wikimedia.org/r/1194294 (https://phabricator.wikimedia.org/T406562)
[11:12:40] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] maintain-views: Add abusefilterblockeddomainhit to allowed log types [puppet] - 10https://gerrit.wikimedia.org/r/1194294 (https://phabricator.wikimedia.org/T406562) (owner: 10Ladsgroup)
[11:13:17] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.ganeti.makevm for new host idp1005.wikimedia.org
[11:13:19] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox
[11:14:14] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views
[11:16:49] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp1005.wikimedia.org - slyngshede@cumin1003"
[11:18:14] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp1005.wikimedia.org - slyngshede@cumin1003"
[11:18:14] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:18:15] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.dns.wipe-cache idp1005.wikimedia.org on all recursors
[11:18:18] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp1005.wikimedia.org on all recursors
[11:18:44] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp1005.wikimedia.org - slyngshede@cumin1003"
[11:18:48] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp1005.wikimedia.org - slyngshede@cumin1003"
[11:20:22] <wikibugs>	 (03PS4) 10Revi: kowikisource: Add "해석" namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193521 (https://phabricator.wikimedia.org/T406405)
[11:20:54] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host idp1005.wikimedia.org with OS trixie
[11:21:09] <logmsgbot>	 !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99)
[11:21:40] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views
[11:23:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[11:27:56] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0)
[11:28:29] <wikibugs>	 (03PS4) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194597 (https://phabricator.wikimedia.org/T406318)
[11:30:44] <wikibugs>	 (03PS1) 10Muehlenhoff: installserver: Drop support for legacy startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487)
[11:32:43] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on idp1005.wikimedia.org with reason: host reimage
[11:32:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] installserver: Drop support for legacy startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[11:35:31] <wikibugs>	 (03PS2) 10Muehlenhoff: installserver: Drop support for legacy startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487)
[11:35:56] <Amir1>	 jouncebot: nowandnext
[11:35:56] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 24 minute(s)
[11:35:56] <jouncebot>	 In 0 hour(s) and 24 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200)
[11:35:56] <jouncebot>	 In 0 hour(s) and 24 minute(s): Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200)
[11:36:24] <Amir1>	 okay, I wait until gerrit maint is over
[11:36:53] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp1005.wikimedia.org with reason: host reimage
[11:38:09] <wikibugs>	 (03PS1) 10Revi: kowiki: Restrict move ratelimit for non-extendedconfirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194916 (https://phabricator.wikimedia.org/T406849)
[11:38:25] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[11:39:17] <Dreamy_Jazz>	 jouncebot: nowandnext
[11:39:17] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 20 minute(s)
[11:39:17] <jouncebot>	 In 0 hour(s) and 20 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200)
[11:39:17] <jouncebot>	 In 0 hour(s) and 20 minute(s): Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200)
[11:40:04] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.reboot-single for host dbprov2007.codfw.wmnet
[11:40:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194916 (https://phabricator.wikimedia.org/T406849) (owner: 10Revi)
[11:41:43] <wikibugs>	 (03PS2) 10Arnaudb: Revert^2 "gerrit: Switchover gerrit1003 → gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1193845 (https://phabricator.wikimedia.org/T387833)
[11:41:43] <wikibugs>	 (03PS2) 10Arnaudb: Revert^2 "gerrit: switchover from gerrit1003 to gerrit2003" [dns] - 10https://gerrit.wikimedia.org/r/1193846 (https://phabricator.wikimedia.org/T387833)
[11:43:28] <wikibugs>	 (03PS3) 10Muehlenhoff: installserver: Drop support for legacy atftpd startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487)
[11:45:03] <wikibugs>	 06SRE, 05MW-1.45-notes (1.45.0-wmf.22; 2025-10-07), 06Product Safety and Integrity (Sprint Apfel Strudel (Sep 29 - Oct 17)), 05WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11259418 (10k...
[11:45:32] <wikibugs>	 06SRE, 05MW-1.45-notes (1.45.0-wmf.22; 2025-10-07), 06Product Safety and Integrity (Sprint Apfel Strudel (Sep 29 - Oct 17)), 05WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11259425 (10k...
[11:45:45] <wikibugs>	 06SRE, 05MW-1.45-notes (1.45.0-wmf.22; 2025-10-07), 06Product Safety and Integrity (Sprint Apfel Strudel (Sep 29 - Oct 17)), 05WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): Investigate options for automatic fallback to FancyCAPTCHA - https://phabricator.wikimedia.org/T404204#11259426 (10k...
[11:46:15] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dbprov2007.codfw.wmnet
[11:47:03] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[11:51:44] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[11:53:09] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp1005.wikimedia.org with OS trixie
[11:53:09] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp1005.wikimedia.org
[11:53:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406799#11259462 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[11:53:20] <fabfur>	 !log disable puppet on A:cp to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194676 on cp5021 (T404427) 
[11:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:29] <wikibugs>	 (03CR) 10Muehlenhoff: "(PCC failure is expected, the  install servers use Puppet7 syntax)" [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[11:53:40] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Switch the wdqs-internal services from http to https [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) (owner: 10Btullis)
[11:54:34] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1193846 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[11:55:22] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11259467 (10Jclark-ctr)
[11:55:29] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy: try to parse also non utf8 characters [puppet] - 10https://gerrit.wikimedia.org/r/1194676 (https://phabricator.wikimedia.org/T404427) (owner: 10Fabfur)
[11:56:05] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, we should keep in mind to enable the backups  I3dc22ca2c6eb89aade24601cf3699c51faf47fd6" [puppet] - 10https://gerrit.wikimedia.org/r/1193845 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[11:56:36] <wikibugs>	 (03PS1) 10Muehlenhoff: atftpd: Drop service definition [puppet] - 10https://gerrit.wikimedia.org/r/1194917 (https://phabricator.wikimedia.org/T396487)
[11:57:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11259470 (10Jclark-ctr)
[11:57:44] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194917 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[11:57:54] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:59:10] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#11259473 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[11:59:33] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#11259477 (10Jclark-ctr) a:05Jclark-ctr→03VRiley-WMF
[11:59:42] <moritzm>	 !log installing luajit security updates
[11:59:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200)
[12:00:05] <jouncebot>	 arnaudb and hashar: Deploy window Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200)
[12:00:16] <arnaudb>	 yep
[12:01:48] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Revert^2 "gerrit: Switchover gerrit1003 → gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1193845 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[12:01:57] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Revert^2 "gerrit: switchover from gerrit1003 to gerrit2003" [dns] - 10https://gerrit.wikimedia.org/r/1193846 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[12:02:05] <logmsgbot>	 !log arnaudb@dns1004 START - running authdns-update
[12:03:37] <logmsgbot>	 !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit1003, replica=gerrit2003)
[12:03:41] <logmsgbot>	 !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.topology-check (exit_code=0) Validate Gerrit topology (source=gerrit1003, replica=gerrit2003)
[12:03:49] <logmsgbot>	 !log arnaudb@cumin1003 START - Cookbook sre.gerrit.failover from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org
[12:04:31] <logmsgbot>	 !log arnaudb@cumin1003 START - Cookbook sre.gerrit.read-only-toggle from gerrit1003.wikimedia.org
[12:04:35] <logmsgbot>	 !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.read-only-toggle (exit_code=0) from gerrit1003.wikimedia.org
[12:05:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11259508 (10Jclark-ctr) Replacement drive has arrived
[12:06:45] <logmsgbot>	 !log arnaudb@cumin1003 START - Cookbook sre.gerrit.read-only-toggle from gerrit2003.wikimedia.org
[12:07:13] <logmsgbot>	 !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.read-only-toggle (exit_code=0) from gerrit2003.wikimedia.org
[12:07:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11259511 (10MoritzMuehlenhoff)
[12:10:07] <fabfur>	 !log enable puppet on A:cp-eqsin to deploy https://gerrit.wikimedia.org/r/1194676 (T404427)
[12:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:25] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:12:42] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:12:54] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:13:20] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:13:20] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:13:22] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:13:26] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.ganeti.makevm for new host idp2005.wikimedia.org
[12:13:27] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox
[12:14:12] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:14:16] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:14:42] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:14:58] <jelto>	 ^ I silenced this for the gerrit switchover for one hour
[12:15:14] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:16:06] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:16:06] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:16:08] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:16:08] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:16:22] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:16:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:16:36] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:16:36] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:16:38] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:16:54] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:17:18] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp2005.wikimedia.org - slyngshede@cumin1003"
[12:17:49] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp2005.wikimedia.org - slyngshede@cumin1003"
[12:17:50] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:17:50] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.dns.wipe-cache idp2005.wikimedia.org on all recursors
[12:17:53] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp2005.wikimedia.org on all recursors
[12:18:24] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp2005.wikimedia.org - slyngshede@cumin1003"
[12:18:28] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp2005.wikimedia.org - slyngshede@cumin1003"
[12:18:44] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:18:48] <fabfur>	 !log reloading haproxy on A:cp-eqsin (T404427)
[12:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:54] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host idp2005.wikimedia.org with OS trixie
[12:20:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:21:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bullseye
[12:21:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:23:14] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:25:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:31:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:33:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11259564 (10BTullis) I can see from `dmesg -T` that the drive in question is `/dev/sde` It was remounted read-only on Oct 3rd. ` [Fri Oct  3 02:12:46 2025]...
[12:36:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:37:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[12:37:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11259596 (10BTullis) OK, this is a software RAID10 volume. It looks like it is drive `/dev/sde` that has failed. ` btullis@druid1011:~$ cat...
[12:39:26] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on idp2005.wikimedia.org with reason: host reimage
[12:44:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[12:44:54] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:46:02] <icinga-wm>	 PROBLEM - SSH on cumin1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:47:44] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[12:47:58] <icinga-wm>	 RECOVERY - SSH on cumin1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:49:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:49:36] <elukey>	 jouncebot: nowandnext
[12:49:36] <jouncebot>	 For the next 0 hour(s) and 10 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200)
[12:49:36] <jouncebot>	 For the next 0 hour(s) and 10 minute(s): Gerrit/Operations#Switch_over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1200)
[12:49:36] <jouncebot>	 In 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1300)
[12:50:31] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp2005.wikimedia.org with reason: host reimage
[12:51:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:53:16] <wikibugs>	 (03CR) 10Arnaudb: [V:03+2 C:03+2] "slow rsync for local backup issue" [puppet] - 10https://gerrit.wikimedia.org/r/1194926 (owner: 10Arnaudb)
[12:53:27] <wikibugs>	 (03CR) 10Arnaudb: [V:03+2 C:03+2] "slow rsync for local backup issue" [dns] - 10https://gerrit.wikimedia.org/r/1194927 (owner: 10Arnaudb)
[12:53:31] <logmsgbot>	 !log arnaudb@dns1004 START - running authdns-update
[12:53:40] <logmsgbot>	 !log arnaudb@dns1004 START - running authdns-update
[12:53:53] <logmsgbot>	 !log arnaudb@dns1004 START - running authdns-update
[12:54:12] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:54:42] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:55:03] <logmsgbot>	 !log arnaudb@dns1004 END - running authdns-update
[12:55:16] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:55:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] Cloudcephosd1050: Configure ceph with a single nic [puppet] - 10https://gerrit.wikimedia.org/r/1191086 (https://phabricator.wikimedia.org/T405478) (owner: 10Andrew Bogott)
[12:56:06] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:56:08] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:56:10] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:56:10] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:56:10] <icinga-wm>	 RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1235 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[12:56:22] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:56:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:56:36] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:56:36] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:56:40] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:56:54] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:57:54] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:58:12] <fabfur>	 !log enable puppet on A:cp to deploy https://gerrit.wikimedia.org/r/1194676 (T404427)
[12:58:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:22] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:58:22] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:58:22] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:58:23] <wikibugs>	 (03CR) 10Federico Ceratto: "Yes please. Also I'm planning to add safety checks based on your comment above." [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto)
[12:59:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11259678 (10Jclark-ctr) @BTullis Failed drive has been replaced!  Thanks for the assistance
[12:59:16] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[12:59:31] <Dreamy_Jazz>	 Nothing seems to be running in https://integration.wikimedia.org/zuul/. Is that currently expected?
[12:59:59] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1300)
[13:00:05] <jouncebot>	 edsanders, Daimona, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:14] <Lucas_WMDE>	 Dreamy_Jazz: sounds relatively expected to me, from what I could tell Gerrit only became available again a few minutes ago
[13:00:22] <Lucas_WMDE>	 I don’t know if the maintenance is considered over or not
[13:00:32] <Dreamy_Jazz>	 Yeah, I was asking for the window
[13:00:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bullseye
[13:00:39] <Dreamy_Jazz>	 Given that there are patches to go through zuul
[13:01:00] <Daimona>	 o/
[13:01:07] * Lucas_WMDE sees Dreamy_Jazz already resubmitted https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/1192178 and closes the tab
[13:01:22] <Dreamy_Jazz>	 Well it isn't going currently :D
[13:01:30] <Lucas_WMDE>	 I see
[13:01:48] <Lucas_WMDE>	 anyway, I have a meeting in 15 minutes so I’m not the best person to run today’s backport window anyway… anyone else? ^^
[13:02:16] <Dreamy_Jazz>	 We probably can't run it until zuul is working again
[13:02:33] <Dreamy_Jazz>	 Though my one doesn't need to go through gerrit, so I could make a start on that
[13:03:04] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] es2053.yaml: Prepare es2053 for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1194644 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[13:03:06] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml: Remove es2053 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1194643 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[13:03:20] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:03:55] <Dreamy_Jazz>	 Can everyone in this window self deploy? May not need someone to handle it given that?
[13:04:49] <Dreamy_Jazz>	 I'm going to start on my one given it doesn't need zuul to be working
[13:04:51] <Daimona>	 I cannot
[13:04:55] <Lucas_WMDE>	 Dreamy_Jazz: go ahead
[13:05:12] <Dreamy_Jazz>	 I should be around for the entire window if we are able to do the patches that need zuul
[13:05:17] <edsanders>	 o/
[13:05:30] <Lucas_WMDE>	 arnaudb, hashar: is the Gerrit maintenance still ongoing? (wondering because of the backport+config window)
[13:05:42] <edsanders>	 I can self deploy - let me know when it's my turn
[13:06:31] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on es2053.codfw.wmnet with reason: Setting up new ES host
[13:08:26] <Lucas_WMDE>	 https://phabricator.wikimedia.org/T406762#11259697 sounds like there are problems with the Gerrit maintenance 😬
[13:08:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bullseye
[13:08:51] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp2005.wikimedia.org with OS trixie
[13:08:51] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp2005.wikimedia.org
[13:11:07] <Dreamy_Jazz>	 Running scap now for the private code chnage
[13:12:44] <Lucas_WMDE>	 okay, slack says the gerrit switchover is reverted once more (due to the issue mentioned above)
[13:12:51] <Lucas_WMDE>	 (I assume it’ll be sent to wikitech-l in a moment)
[13:14:16] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:14:24] <Lucas_WMDE>	 looks like Zuul is still asleep
[13:14:54] <Lucas_WMDE>	 (someone™ could in theory run Daimona’s maintenance script but I assume it would be better to do that only *after* deploying the config change)
[13:15:07] <Dreamy_Jazz>	 Yeah, I think so
[13:15:14] * Lucas_WMDE in meeting
[13:15:42] <edsanders>	 Dreamy_Jazz: do I need to wait to start my config deployment?
[13:15:57] <Dreamy_Jazz>	 Yeah. Zuul isn't running, so it won't be able to be merged
[13:15:58] <Daimona>	 Shouldn't it be done before, since the config change removes the group? Or is it still possible to empty a group that no longer exists?
[13:16:13] <Dreamy_Jazz>	 Additionally I am currently running scap at the moment
[13:16:55] <Dreamy_Jazz>	 Re whether to run the script, maybe it should be run before but then the config change merged immediately afterwards?
[13:17:13] <Dreamy_Jazz>	 To minimise the time that some proportion of users lack access to the tool
[13:17:19] <hashar>	 I will take care of zuul
[13:17:21] <Lucas_WMDE>	 yeah, we should at least know that we’ll be able to run the config change soon ^^
[13:17:25] <Lucas_WMDE>	 thx hashar
[13:17:37] <Dreamy_Jazz>	 Thanks
[13:18:08] <hashar>	 Zuul used to be able to reconnect to Gerrit
[13:18:13] <wikibugs>	 10ops-eqiad, 06DC-Ops: Power Supply Redundancy alert on db1241 - https://phabricator.wikimedia.org/T406863 (10FCeratto-WMF) 03NEW
[13:18:15] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:18:26] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.reimage: allow the usage of the pxe_media arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1194930
[13:19:20] <hashar>	 Queue lenghts at 0
[13:19:27] <hashar>	 there are 8 connections on gerrit originating from jenkins-bot
[13:19:51] <wikibugs>	 (03PS1) 10Arnaudb: Revert^4 "gerrit: Switchover gerrit1003 → gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1194931
[13:20:04] <wikibugs>	 (03PS1) 10Arnaudb: Revert^4 "gerrit: switchover from gerrit1003 to gerrit2003" [dns] - 10https://gerrit.wikimedia.org/r/1194932
[13:20:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cloudceph: fix single-nic vlan interface specification [puppet] - 10https://gerrit.wikimedia.org/r/1194933 (https://phabricator.wikimedia.org/T405478)
[13:20:14] <Daimona>	 Yeah that makes sense
[13:20:49] <Daimona>	 Or if we're still waiting, I can split the config change
[13:20:55] <hashar>	 !log Closed jenkins-bot connections on Gerrit primary
[13:20:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:14] <hashar>	 yeah it reconnected
[13:21:16] <Dreamy_Jazz>	 I'm done with my deploy, so the next config change should be ready to go once zuul is back
[13:21:32] <hashar>	 !log Zuul successfully reconnected to Gerrit
[13:21:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:48] <Dreamy_Jazz>	 esanders: Do you have deployment rights and want to self deploy this?
[13:22:00] <Dreamy_Jazz>	 edsanders:
[13:22:09] <edsanders>	 yeah - I can
[13:22:16] <edsanders>	 shall I start?
[13:22:19] <Dreamy_Jazz>	 Over to you then, yes
[13:22:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1194933 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi)
[13:22:39] <Dreamy_Jazz>	 I can hang around for Daimona's changes if you just want to handle yours
[13:22:56] <Daimona>	 I'm going to split my patch
[13:23:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194588 (owner: 10Esanders)
[13:24:08] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Invalidate Flow cache on enwiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194588 (owner: 10Esanders)
[13:24:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[13:24:26] <logmsgbot>	 !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1194588|Revert "Invalidate Flow cache on enwiktionary"]]
[13:24:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] "I'm live-iterating (!) on this atm and self-merging for expediency" [puppet] - 10https://gerrit.wikimedia.org/r/1194933 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi)
[13:24:51] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:24:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] cloudceph: fix single-nic vlan interface specification [puppet] - 10https://gerrit.wikimedia.org/r/1194933 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi)
[13:25:26] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Assign CampaignEvents user rights to autoconfirmed in small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194744 (https://phabricator.wikimedia.org/T401445)
[13:25:26] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Delete the event-organizer user group on medium and small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194935 (https://phabricator.wikimedia.org/T401445)
[13:25:27] <wikibugs>	 (03PS1) 10CDanis: WIP: ja4h lua first draft, & concat [puppet] - 10https://gerrit.wikimedia.org/r/1194934
[13:26:09] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good. I'll test this with sretest and will report back" [cookbooks] - 10https://gerrit.wikimedia.org/r/1194930 (owner: 10Elukey)
[13:26:32] <Daimona>	 Done, and calendar updated
[13:28:24] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-10-06-215412 to 2025-10-09-001812 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194936 (https://phabricator.wikimedia.org/T405130)
[13:28:30] <logmsgbot>	 !log esanders@deploy2002 esanders: Backport for [[gerrit:1194588|Revert "Invalidate Flow cache on enwiktionary"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:28:40] <Lucas_WMDE>	 my meeting is over, so I could also take over deployment if needed
[13:28:47] <Dreamy_Jazz>	 Thanks. I guess the order is grant autoconfirmed access, run the script, then undefine the group
[13:28:49] <Lucas_WMDE>	 (didn’t take as long as expected ^^)
[13:28:52] <logmsgbot>	 !log esanders@deploy2002 esanders: Continuing with sync
[13:29:09] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-10-06-215412 to 2025-10-09-001812 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194936 (https://phabricator.wikimedia.org/T405130) (owner: 10Jforrester)
[13:29:14] <Dreamy_Jazz>	 Lucas_WMDE: If you could that would be great, as I want to go eat some lunch
[13:29:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[13:30:11] <Lucas_WMDE>	 sure!
[13:30:28] <icinga-wm>	 RECOVERY - dump of s1 in codfw on backupmon1001 is OK: Last dump for s1 at codfw (db2141) taken on 2025-10-09 11:44:02 (180 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[13:30:37] <Dreamy_Jazz>	 Thanks. See all of yous later \o
[13:30:51] <Daimona>	 Thank you! (Also in a meeting BTW)
[13:31:16] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-10-06-215412 to 2025-10-09-001812 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194936 (https://phabricator.wikimedia.org/T405130) (owner: 10Jforrester)
[13:31:38] <wikibugs>	 (03Abandoned) 10Dzahn: gerrit: increase QS_ClientPrefer threshold [puppet] - 10https://gerrit.wikimedia.org/r/1194702 (owner: 10Dzahn)
[13:32:00] <wikibugs>	 (03Restored) 10Dzahn: gerrit: increase QS_ClientPrefer threshold [puppet] - 10https://gerrit.wikimedia.org/r/1194702 (owner: 10Dzahn)
[13:32:06] <wikibugs>	 (03PS2) 10Dzahn: gerrit: increase QS_ClientPrefer threshold [puppet] - 10https://gerrit.wikimedia.org/r/1194702 (https://phabricator.wikimedia.org/T406774)
[13:32:24] <wikibugs>	 (03Abandoned) 10Dzahn: gerrit: increase QS_ClientPrefer threshold [puppet] - 10https://gerrit.wikimedia.org/r/1194702 (https://phabricator.wikimedia.org/T406774) (owner: 10Dzahn)
[13:32:50] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[13:32:55] <logmsgbot>	 !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194588|Revert "Invalidate Flow cache on enwiktionary"]] (duration: 08m 29s)
[13:34:11] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[13:34:29] <Lucas_WMDE>	 alright, I’ll take over
[13:35:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194744 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[13:36:07] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[13:36:14] <wikibugs>	 (03Merged) 10jenkins-bot: Assign CampaignEvents user rights to autoconfirmed in small and medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194744 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[13:36:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1194744|Assign CampaignEvents user rights to autoconfirmed in small and medium wikis (T401445)]]
[13:36:36] <stashbot>	 T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445
[13:36:53] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[13:37:11] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[13:37:59] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[13:41:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 daimona, lucaswerkmeister-wmde: Backport for [[gerrit:1194744|Assign CampaignEvents user rights to autoconfirmed in small and medium wikis (T401445)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:41:32] <Lucas_WMDE>	 Daimona: please test the first half :)
[13:41:46] <Lucas_WMDE>	 (or the first third? because the maintenance script is also a step? whatever ^^)
[13:41:59] <wikibugs>	 (03PS1) 10Muehlenhoff: hcaptcha_proxy: Select the custom nginx provider instead of extras [puppet] - 10https://gerrit.wikimedia.org/r/1194940 (https://phabricator.wikimedia.org/T405631)
[13:43:53] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough
[13:43:56] <Daimona>	 LGTM!
[13:44:10] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox
[13:44:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 daimona, lucaswerkmeister-wmde: Continuing with sync
[13:44:16] <Lucas_WMDE>	 ok!
[13:44:19] <Lucas_WMDE>	 (me too)
[13:44:34] <wikibugs>	 (03PS1) 10Elukey: Add the node labeller binary to the package. [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806)
[13:45:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bullseye
[13:45:30] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Thanks, sounds good. https://wiki.debian.org/Nginx is not updated, so what's another good way of finding out what goes in all the differen" [puppet] - 10https://gerrit.wikimedia.org/r/1194940 (https://phabricator.wikimedia.org/T405631) (owner: 10Muehlenhoff)
[13:45:52] <wikibugs>	 (03PS2) 10CDanis: WIP: ja4h lua first draft, & concat [puppet] - 10https://gerrit.wikimedia.org/r/1194934
[13:48:15] <Lucas_WMDE>	 Daimona: for the maintenance script, what exactly should I run?
[13:48:22] <Lucas_WMDE>	 my feeling would be that we want it *with* --create-log
[13:48:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194744|Assign CampaignEvents user rights to autoconfirmed in small and medium wikis (T401445)]] (duration: 11m 51s)
[13:48:28] <stashbot>	 T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445
[13:48:34] <Lucas_WMDE>	 and --log-reason='[[phabricator:T401445|T401445]]' or something like that
[13:48:39] <Daimona>	 I wasn't sure about the exact options because I've never done this before
[13:48:48] <Lucas_WMDE>	 I’m just looking at the help output on my localhost ^^
[13:48:48] <wikibugs>	 (03CR) 10Muehlenhoff: "The underlying idea is that everyone should only use the generic nginx binary and then install libnginx-mod-foo packages as they need them" [puppet] - 10https://gerrit.wikimedia.org/r/1194940 (https://phabricator.wikimedia.org/T405631) (owner: 10Muehlenhoff)
[13:48:51] <Daimona>	 I'm also not sure how many users will be affected
[13:48:52] * Lucas_WMDE searches SAL
[13:49:03] <Daimona>	 Maybe not too many, which is why we're changing this in the first place
[13:49:22] <Lucas_WMDE>	 yeah that would’ve been my guess
[13:49:34] <Lucas_WMDE>	 https://sal.toolforge.org/production?p=0&q=emptyUserGroup*&d= suggests the create log option might be pretty new
[13:50:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] hcaptcha_proxy: Select the custom nginx provider instead of extras [puppet] - 10https://gerrit.wikimedia.org/r/1194940 (https://phabricator.wikimedia.org/T405631) (owner: 10Muehlenhoff)
[13:50:03] <Lucas_WMDE>	 Dreamy_Jazz: if you’re still here – apparently you created (or backported) that option in May, but I don’t see a SAL entry using the option; were there any problems or was it just not !log’ed?
[13:50:12] * Lucas_WMDE looks at the task
[13:50:40] <Lucas_WMDE>	 ok https://phabricator.wikimedia.org/T393360#10867211 sounds like it worked fine
[13:51:33] <wikibugs>	 (03CR) 10Vgutierrez: "pretty cool job <3" [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (owner: 10CDanis)
[13:51:34] <Daimona>	 Yeah, something pointing to the phab task would work I think
[13:51:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11260003 (10Jclark-ctr) I will work on this tomorrow since i am using an older drive i want to make sure it is wiped prior to installing.
[13:51:54] <wikibugs>	 (03CR) 10Cathal Mooney: "Patch seems sane.  Not terribly familiar with how this works but if it's already ok on the bookworm install servers makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[13:52:00] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] installserver: Drop support for legacy atftpd startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[13:52:25] <Lucas_WMDE>	 hmm
[13:52:43] <Lucas_WMDE>	 does mwscript-k8s --dblist support expressions
[13:52:45] <Lucas_WMDE>	 --help says it does
[13:53:21] <Lucas_WMDE>	 so, I would run:
[13:53:22] <Lucas_WMDE>	 mwscript-k8s --comment=T401445 --sal --dblist=small+medium -- emptyUserGroup --create-log --log-reason='[[phabricator:T401445|T401445]]' event-organizer
[13:53:39] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[13:53:46] <Daimona>	 Seems correct
[13:53:52] <Lucas_WMDE>	 running
[13:54:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: foreachwikiindblist small+medium emptyUserGroup --create-log '--log-reason=[[phabricator:T401445|T401445]]' event-organizer  # T401445
[13:54:01] <Daimona>	 Wish it had a dry-run but nvm
[13:54:03] <stashbot>	 T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445
[13:54:17] <Lucas_WMDE>	 oh, oops, I forgot to --follow
[13:54:17] <Lucas_WMDE>	 meh
[13:54:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194935 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[13:54:54] <Daimona>	 Don't forget to --follow and smash that like button (?)
[13:55:05] <Lucas_WMDE>	 :trout:
[13:55:09] <Lucas_WMDE>	 it failed :(
[13:55:14] <Lucas_WMDE>	 can I figure out why though
[13:55:16] <wikibugs>	 (03Merged) 10jenkins-bot: Delete the event-organizer user group on medium and small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194935 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[13:55:27] <Lucas_WMDE>	 the output from `K8S_CLUSTER=codfw KUBECONFIG=/etc/kubernetes/mw-script-codfw.config kubectl logs -f job/mw-script.codfw.tbw487ih mediawiki-tbw487ih-app` is useless
[13:55:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1194935|Delete the event-organizer user group on medium and small wikis (T401445)]]
[13:55:34] <Lucas_WMDE>	 “emptyUserGroup: Running on small+medium” and that’s it
[13:55:43] <Daimona>	 That's not very informative
[13:55:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bullseye
[13:55:50] <Lucas_WMDE>	 I… guess I should abort that scap once it’s hit mwdebug
[13:56:03] <Lucas_WMDE>	 lemme just try running it on small and medium separately
[13:56:13] <Daimona>	 Yeah thought the same
[13:56:28] <Lucas_WMDE>	 I don’t actually know if + is a valid dblist operator, I just assumed
[13:56:32] <Lucas_WMDE>	 but I don’t use dblist maths often ^^
[13:56:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: foreachwikiindblist small emptyUserGroup --create-log '--log-reason=[[phabricator:T401445|T401445]]' event-organizer  # T401445
[13:56:41] <Lucas_WMDE>	 yeah that looks much better
[13:56:42] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough
[13:56:43] <Lucas_WMDE>	 guess that was it then
[13:56:56] <Lucas_WMDE>	 a lot of “group was empty” so far
[13:56:58] <Lucas_WMDE>	 4 users in one wiki
[13:57:02] <Lucas_WMDE>	 I’ll export the output later
[13:57:07] <Lucas_WMDE>	 (could’ve tee’d it to a file, meh)
[13:57:40] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] site.pp: reimage all hcaptcha nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/1194715 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh)
[13:58:10] <Daimona>	 Yeah I don't think we're expecting many changes
[13:59:02] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: local backup on source server only [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833)
[13:59:02] <wikibugs>	 (03CR) 10Arnaudb: "simple step to speed up next switchover" [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[13:59:35] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: local backup on source server only [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833)
[13:59:40] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1050.eqiad.wmnet
[14:00:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1194935|Delete the event-organizer user group on medium and small wikis (T401445)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:00:05] <stashbot>	 T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445
[14:00:18] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: foreachwikiindblist medium emptyUserGroup --create-log '--log-reason=[[phabricator:T401445|T401445]]' event-organizer  # T401445
[14:01:01] <Lucas_WMDE>	 let’s let the medium run finish before syncing
[14:01:13] <Lucas_WMDE>	 but Daimona can you test the second change in the meantime?
[14:01:27] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha1001.wikimedia.org with OS bookworm
[14:01:40] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha2001.wikimedia.org with OS bookworm
[14:02:17] <Lucas_WMDE>	 !log for the record, the `foreachwikiindblist small+medium emptyUserGroup` maintenance script run (for T401445) did *not* work, running the maintenance script separately for small and medium worked better
[14:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:47] <Lucas_WMDE>	 (I don’t like the idea that someone might pull the broken command out of the SAL without realizing it, hence this log ^^)
[14:03:31] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha1002.wikimedia.org with OS bookworm
[14:03:35] <Daimona>	 Well but they'd need to read that entry before copypasting the command :D
[14:03:42] <Daimona>	 Anyway, testing the second part now
[14:03:44] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha2002.wikimedia.org with OS bookworm
[14:03:50] <Lucas_WMDE>	 yeah, that’s why I tried to include some of the same terms someone might’ve been searching for :P
[14:03:52] <Lucas_WMDE>	 thanks
[14:04:59] <Daimona>	 User group removal looks good too
[14:05:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Continuing with sync
[14:05:33] <Lucas_WMDE>	 ok!
[14:05:39] <wikibugs>	 (03CR) 10CDanis: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (owner: 10CDanis)
[14:05:44] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1050.eqiad.wmnet
[14:08:17] <sukhe>	 !log restart pybal on lvs1020 to pick up WDQS changes
[14:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:31] <hnowlan>	 jouncebot: nowandnext
[14:09:32] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 20 minute(s)
[14:09:32] <jouncebot>	 In 0 hour(s) and 20 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1430)
[14:09:54] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:10:19] <logmsgbot>	 elukey@cumin1003 provision (PID 1990297) is awaiting input
[14:10:20] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194935|Delete the event-organizer user group on medium and small wikis (T401445)]] (duration: 14m 47s)
[14:10:23] <stashbot>	 T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445
[14:12:51] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:12:52] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha1001.wikimedia.org with reason: host reimage
[14:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:15] <Daimona>	 Thanks Lucas!
[14:13:21] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[14:13:23] <Lucas_WMDE>	 np :)
[14:13:49] <wikibugs>	 (03CR) 10Ladsgroup: Avoid using wikitech dblist in configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup)
[14:15:14] <Daimona>	 Wait 
[14:15:22] <Daimona>	 Did I just screw up and the group didn't need to be empties
[14:15:23] <Daimona>	 d
[14:16:52] <wikibugs>	 (03CR) 10Ladsgroup: Avoid using wikitech dblist in configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup)
[14:17:56] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha1002.wikimedia.org with reason: host reimage
[14:18:28] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha2001.wikimedia.org with reason: host reimage
[14:18:54] <icinga-wm>	 RECOVERY - dump of s3 in codfw on backupmon1001 is OK: Last dump for s3 at codfw (db2239) taken on 2025-10-09 11:44:02 (127 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[14:19:05] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha1001.wikimedia.org with reason: host reimage
[14:20:05] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] sshd: use the default KexAlgorithms algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1194734 (owner: 10JHathaway)
[14:20:29] <wikibugs>	 (03CR) 10Hashar: [C:04-1] "-1 to ack the remarks mentioned by Tacsipacsi. I'll take them in account and amend, but not today :]" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 (owner: 10Hashar)
[14:20:30] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194597 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert)
[14:21:41] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha2001.wikimedia.org with reason: host reimage
[14:21:51] <wikibugs>	 (03PS4) 10Scott French: api-gateway: Add support for PHP_ENGINE cookie routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955)
[14:23:04] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha2002.wikimedia.org with reason: host reimage
[14:23:25] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[14:23:34] <wikibugs>	 (03PS4) 10Scott French: rest-gateway: Divert PHP_ENGINE=8.3 requests to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194791 (https://phabricator.wikimedia.org/T405955)
[14:24:54] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:26:22] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:26:56] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-main_443: Servers wdqs1026.eqiad.wmnet are marked down but pooled: wdqs-internal-scholarly_443: Servers wdqs1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:27:02] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:28:26] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha1002.wikimedia.org with reason: host reimage
[14:28:42] <wikibugs>	 (03PS1) 10Bking: wdqs-internal-main,scholarly: Update health check for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1194956 (https://phabricator.wikimedia.org/T193473)
[14:29:12] <hnowlan>	 !log rest.php group2-except-enwiki on rest-gateway at 10% 
[14:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:52] <wikibugs>	 (03PS2) 10CDanis: wdqs-internal-main,scholarly: Update health check for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1194956 (https://phabricator.wikimedia.org/T193473) (owner: 10Bking)
[14:29:54] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1194956 (https://phabricator.wikimedia.org/T193473) (owner: 10Bking)
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1430)
[14:31:31] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha2002.wikimedia.org with reason: host reimage
[14:34:24] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:35:03] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:35:31] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha1001.wikimedia.org with OS bookworm
[14:36:28] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2046.codfw.wmnet']
[14:36:45] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2046.codfw.wmnet']
[14:36:48] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2046.codfw.wmnet']
[14:37:01] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp2046.codfw.wmnet']
[14:37:22] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha2001.wikimedia.org with OS bookworm
[14:39:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11260203 (10Jclark-ctr) 05Open→03Resolved
[14:39:37] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1003.eqiad.wmnet with OS bullseye
[14:39:57] <wikibugs>	 (03PS1) 10SBassett: OATHAuth Recovery Code code improvement [extensions/OATHAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194962 (https://phabricator.wikimedia.org/T406501)
[14:42:26] <icinga-wm>	 RECOVERY - Host ms-be2078 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms
[14:42:35] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:42:47] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox
[14:43:00] <wikibugs>	 (03PS1) 10D3r1ck01: session: Improve logging for MultiBackendSessionStore [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194963 (https://phabricator.wikimedia.org/T402808)
[14:43:18] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] installserver: Drop support for legacy atftpd startup [puppet] - 10https://gerrit.wikimedia.org/r/1194915 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[14:43:19] <wikibugs>	 (03PS1) 10D3r1ck01: session: Improve logging for MultiBackendSessionStore [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194964 (https://phabricator.wikimedia.org/T402808)
[14:44:02] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] wdqs-internal-main,scholarly: Update health check for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1194956 (https://phabricator.wikimedia.org/T193473) (owner: 10Bking)
[14:44:05] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] wdqs-internal-main,scholarly: Update health check for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1194956 (https://phabricator.wikimedia.org/T193473) (owner: 10Bking)
[14:44:09] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha1002.wikimedia.org with OS bookworm
[14:44:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194963 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[14:44:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194964 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[14:46:10] <wikibugs>	 (03PS2) 10SBassett: OATHAuth Recovery Code code improvement [extensions/OATHAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194962 (https://phabricator.wikimedia.org/T406501)
[14:47:20] <sukhe>	 !log restart pybal on lvs1020
[14:47:20] <wikibugs>	 (03PS3) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066)
[14:47:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:35] <wikibugs>	 (03CR) 10SBassett: OATHAuth Recovery Code code improvement (031 comment) [extensions/OATHAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194962 (https://phabricator.wikimedia.org/T406501) (owner: 10SBassett)
[14:47:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz)
[14:47:56] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:48:34] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha2002.wikimedia.org with OS bookworm
[14:49:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/OATHAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194962 (https://phabricator.wikimedia.org/T406501) (owner: 10SBassett)
[14:49:31] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] sre.hosts.reimage: allow the usage of the pxe_media arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1194930 (owner: 10Elukey)
[14:49:39] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cloudceph: handle double -> single NIC transition [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478)
[14:49:39] <wikibugs>	 (03PS7) 10Federico Ceratto: clone_es.py: clone readonly es* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646
[14:50:42] <wikibugs>	 (03PS4) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194598 (https://phabricator.wikimedia.org/T406318)
[14:53:37] <wikibugs>	 (03PS1) 10Elukey: sre.hardware.upgrade-firmware: improve matching for SSD checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969
[14:53:43] <hnowlan>	 jouncebot: nowandnext
[14:53:43] <jouncebot>	 For the next 0 hour(s) and 6 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1430)
[14:53:43] <jouncebot>	 In 0 hour(s) and 6 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1500)
[14:55:18] <wikibugs>	 (03CR) 10Majavah: [C:04-1] cloudceph: handle double -> single NIC transition (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi)
[14:58:08] <icinga-wm>	 RECOVERY - dump of s4 in codfw on backupmon1001 is OK: Last dump for s4 at codfw (db2239) taken on 2025-10-09 13:20:35 (238 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[15:00:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "I've confirmed with a reimage that it fixes passing a different installer environment." [cookbooks] - 10https://gerrit.wikimedia.org/r/1194930 (owner: 10Elukey)
[15:00:05] <jouncebot>	 jnuche and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1500)
[15:00:21] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1194598 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert)
[15:01:16] <wikibugs>	 (03PS4) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194599 (https://phabricator.wikimedia.org/T406318)
[15:02:54] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11260369 (10Jhancock.wm) @MatthewVernon ms-be2083 and ms-be2084 controllers have been swapped out.
[15:05:12] <wikibugs>	 (03CR) 10JHathaway: sre.hardware.upgrade-firmware: improve matching for SSD checks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (owner: 10Elukey)
[15:07:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:09:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:35] <wikibugs>	 (03CR) 10Elukey: sre.hardware.upgrade-firmware: improve matching for SSD checks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (owner: 10Elukey)
[15:15:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1194599 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert)
[15:16:29] <wikibugs>	 (03PS1) 10KartikMistry: Update Recommendation API to 2025-10-09-145754-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194971 (https://phabricator.wikimedia.org/T406854)
[15:18:37] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: allow the usage of the pxe_media arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1194930 (owner: 10Elukey)
[15:21:21] <kart_>	 I'll deploy recommendation API. Minor change.
[15:21:37] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2025-10-09-145754-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194971 (https://phabricator.wikimedia.org/T406854) (owner: 10KartikMistry)
[15:22:52] <wikibugs>	 (03PS2) 10Clément Goubert: trafficserver: rest-gateway routes for rest.php: enwiki 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194609 (https://phabricator.wikimedia.org/T406318)
[15:23:41] <wikibugs>	 (03Merged) 10jenkins-bot: Update Recommendation API to 2025-10-09-145754-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194971 (https://phabricator.wikimedia.org/T406854) (owner: 10KartikMistry)
[15:23:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/2 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:25:16] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[15:26:47] <sukhe>	 !log sukhe@lvs1019:~$ sudo systemctl restart pybal.service 
[15:26:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:56] <wikibugs>	 (03PS1) 10Mszwarc: arbcom_plwiki: Change favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194975 (https://phabricator.wikimedia.org/T406883)
[15:31:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194975 (https://phabricator.wikimedia.org/T406883) (owner: 10Mszwarc)
[15:31:47] <wikibugs>	 (03CR) 10Vgutierrez: WIP: ja4h lua first draft, & concat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (owner: 10CDanis)
[15:32:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:34:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:50] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[15:37:04] <wikibugs>	 (03PS3) 10CDanis: WIP: ja4h lua first draft, & concat [puppet] - 10https://gerrit.wikimedia.org/r/1194934
[15:39:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:40:01] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] conftool-data: add hcaptcha[12]00[12].wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1194722 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh)
[15:42:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Create boot environment of Bullseye with a 6.1 kernel - https://phabricator.wikimedia.org/T405102#11260770 (10MoritzMuehlenhoff) I built a bullseye d-i environment with the Linux 6.1 from Debian LTS (https://tracker.debian.org/pkg/linux-6.1) and after some needed fi...
[15:45:56] <wikibugs>	 (03PS1) 10Reedy: DNM (yet): Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194978
[15:46:52] <wikibugs>	 (03PS1) 10Federico Ceratto: site.pp: Add es2052 [puppet] - 10https://gerrit.wikimedia.org/r/1194979 (https://phabricator.wikimedia.org/T402859)
[15:48:27] <Daimona>	 Hey folks! Earlier today, I asked a user group to be emptied on several wikis which I later realized I shouldn't have done. I would like to revert these changes and re-add the affected users to the group (list in https://phabricator.wikimedia.org/P83722). Could I get another pair of eyes on this and run the script shortly?
[15:48:38] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: cluster=proxoid,name=hcatpcha.* [reason: setting weight for proxoid hcaptcha dedicated VM]
[15:48:38] <Daimona>	 ("the script" = createAndPromote)
[15:48:53] <wikibugs>	 (03PS4) 10CDanis: WIP: ja4h lua first draft, & concat [puppet] - 10https://gerrit.wikimedia.org/r/1194934
[15:48:59] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: cluster=proxoid,name=hcaptcha.* [reason: setting weight for proxoid hcaptcha dedicated VM]
[15:51:36] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Revert "Delete the event-organizer user group on medium and small wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194981
[15:51:44] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:52:02] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Revert "Delete the event-organizer user group on medium and small wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194981 (https://phabricator.wikimedia.org/T401445)
[15:52:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply Redundancy alert on db1241 - https://phabricator.wikimedia.org/T406863#11260824 (10VRiley-WMF) a:03VRiley-WMF
[15:52:32] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver: rest-gateway routes for rest.php: enwiki 10% [puppet] - 10https://gerrit.wikimedia.org/r/1194609 (https://phabricator.wikimedia.org/T406318) (owner: 10Clément Goubert)
[15:52:44] <wikibugs>	 (03PS1) 10Ssingh: conftool-data: proxoid: remove urldownloader machines [puppet] - 10https://gerrit.wikimedia.org/r/1194982 (https://phabricator.wikimedia.org/T405631)
[15:55:36] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.provision: fix issue when moving a Dell host to UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892
[15:56:06] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[15:57:36] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] conftool-data: proxoid: remove urldownloader machines [puppet] - 10https://gerrit.wikimedia.org/r/1194982 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh)
[15:57:47] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[15:58:50] <wikibugs>	 (03PS3) 10Elukey: sre.hosts.provision: fix issue when moving a Dell host to UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892
[15:59:37] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[15:59:57] <wikibugs>	 (03PS2) 10Reedy: DNM (yet): Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194978
[16:00:05] <jouncebot>	 jhathaway and moritzm: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:17] <wikibugs>	 (03PS1) 10Ssingh: Revert "conftool-data: proxoid: remove urldownloader machines" [puppet] - 10https://gerrit.wikimedia.org/r/1194984
[16:00:29] <wikibugs>	 (03CR) 10Ssingh: "emergency revert if required, do not merge" [puppet] - 10https://gerrit.wikimedia.org/r/1194984 (owner: 10Ssingh)
[16:02:07] <tgr_>	 jhathaway: do you mind if I deploy a MediaWiki backport during the puppet window?
[16:02:08] <icinga-wm>	 PROBLEM - Host ms-be2078 is DOWN: PING CRITICAL - Packet loss = 100%
[16:02:45] <tgr_>	 maybe two if Daimona still needs the config change to be reverted
[16:03:12] <Daimona>	 I have some more coming so I might add them to the later window
[16:03:17] <cdanis>	 tgr_: I don't think there's anything planned for the puppet window
[16:03:51] <tgr_>	 ok, thanks, I'll go along then
[16:04:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.provision: fix issue when moving a Dell host to UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892 (owner: 10Elukey)
[16:05:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194963 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[16:05:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194964 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[16:06:29] <wikibugs>	 (03CR) 10Ssingh: [C:04-2] Revert "conftool-data: proxoid: remove urldownloader machines" [puppet] - 10https://gerrit.wikimedia.org/r/1194984 (owner: 10Ssingh)
[16:06:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11260904 (10BTullis) 05Resolved→03Open a:05Jclark-ctr→03BTullis Reopening and assigning to myself, because there is a manual op to do here. I hope...
[16:07:53] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Assign campaignevents-generate-invitation-lists right explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194986 (https://phabricator.wikimedia.org/T401445)
[16:08:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Assign campaignevents-generate-invitation-lists right explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194986 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[16:09:57] <wikibugs>	 (03Merged) 10jenkins-bot: session: Improve logging for MultiBackendSessionStore [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1194963 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[16:10:04] <wikibugs>	 (03Merged) 10jenkins-bot: session: Improve logging for MultiBackendSessionStore [core] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194964 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[16:10:27] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1194963|session: Improve logging for MultiBackendSessionStore (T402808 T405633 T405634)]], [[gerrit:1194964|session: Improve logging for MultiBackendSessionStore (T402808 T405633 T405634)]]
[16:10:32] <wikibugs>	 (03PS1) 10Cathal Mooney: inter.link: add BGP community in esams for ddos protection [homer/public] - 10https://gerrit.wikimedia.org/r/1194988 (https://phabricator.wikimedia.org/T400984)
[16:10:36] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[16:10:36] <stashbot>	 T405633: Session data is authenticated, should not be an anonymous user - https://phabricator.wikimedia.org/T405633
[16:10:37] <stashbot>	 T405634: Authenticated data should not be in the anonymous store - https://phabricator.wikimedia.org/T405634
[16:11:27] <wikibugs>	 (03CR) 10Muehlenhoff: "(PCC failure for P5 is expected)" [puppet] - 10https://gerrit.wikimedia.org/r/1194917 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[16:13:31] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Set and propagate X-Request-ID [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976)
[16:14:09] <logmsgbot>	 !log tgr@deploy2002 tgr, d3r1ck01: Backport for [[gerrit:1194963|session: Improve logging for MultiBackendSessionStore (T402808 T405633 T405634)]], [[gerrit:1194964|session: Improve logging for MultiBackendSessionStore (T402808 T405633 T405634)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:14:16] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[16:14:54] <sukhe>	 huh
[16:14:54] <sukhe>	 hmm
[16:15:05] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump to 2025-10-07-043158-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194990
[16:15:24] <sukhe>	 ah from the earlier WDQS one
[16:15:24] <sukhe>	 ok
[16:15:38] <sukhe>	 fixing
[16:17:20] <wikibugs>	 (03PS2) 10Vgutierrez: haproxy: Set and propagate X-Request-ID [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976)
[16:17:25] <wikibugs>	 (03CR) 10CDanis: [C:03+1] haproxy: Set and propagate X-Request-ID [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976) (owner: 10Vgutierrez)
[16:17:49] <wikibugs>	 (03CR) 10CDanis: [C:03+1] haproxy: Set and propagate X-Request-ID [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976) (owner: 10Vgutierrez)
[16:18:14] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[16:18:16] <logmsgbot>	 elukey@cumin1003 provision (PID 2011513) is awaiting input
[16:18:22] <sukhe>	 !log sukhe@lvs2013:~$ sudo systemctl restart pybal.service
[16:18:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:18:46] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2078.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[16:19:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11260967 (10Jhancock.wm) Power(W) = (Voltage(V) * Amperage(A) * sqrt{3} * PowerFactor(PF)) V = 208 A = 30 sqrt{3} = 1.732 PF = .8 (80% of power to allow for fluctuations)  thus 208 * 30 *...
[16:21:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11260969 (10Jhancock.wm)
[16:21:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:21:48] <sukhe>	 always something eh lol
[16:21:51] <vgutierrez>	 :)
[16:23:33] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2025-10-07-043158-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194990 (owner: 10BryanDavis)
[16:24:10] <wikibugs>	 (03PS5) 10CDanis: WIP: ja4h lua first draft, & concat [puppet] - 10https://gerrit.wikimedia.org/r/1194934
[16:24:16] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[16:25:46] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump to 2025-10-07-043158-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194990 (owner: 10BryanDavis)
[16:26:11] <wikibugs>	 (03CR) 10CDanis: [C:03+1] inter.link: add BGP community in esams for ddos protection [homer/public] - 10https://gerrit.wikimedia.org/r/1194988 (https://phabricator.wikimedia.org/T400984) (owner: 10Cathal Mooney)
[16:26:24] <logmsgbot>	 !log tgr@deploy2002 tgr, d3r1ck01: Continuing with sync
[16:27:31] <wikibugs>	 (03PS4) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066)
[16:27:31] <wikibugs>	 (03PS1) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on group0 [puppet] - 10https://gerrit.wikimedia.org/r/1194994 (https://phabricator.wikimedia.org/T385066)
[16:27:33] <wikibugs>	 (03PS1) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on group1 [puppet] - 10https://gerrit.wikimedia.org/r/1194995 (https://phabricator.wikimedia.org/T385066)
[16:27:35] <wikibugs>	 (03PS1) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1194996 (https://phabricator.wikimedia.org/T385066)
[16:28:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11261019 (10Jhancock.wm) @elukey license uploaded for cp2056. should be good to try that one again.
[16:30:34] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194963|session: Improve logging for MultiBackendSessionStore (T402808 T405633 T405634)]], [[gerrit:1194964|session: Improve logging for MultiBackendSessionStore (T402808 T405633 T405634)]] (duration: 20m 07s)
[16:30:41] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[16:30:41] <stashbot>	 T405633: Session data is authenticated, should not be an anonymous user - https://phabricator.wikimedia.org/T405633
[16:30:42] <stashbot>	 T405634: Authenticated data should not be in the anonymous store - https://phabricator.wikimedia.org/T405634
[16:30:46] <tgr_>	 done, thanks
[16:33:10] <cwhite>	 !log upgrade grafana-loki on grafana hosts T406478
[16:33:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:33] <stashbot>	 T406478: Scap logs on Grafana dashboards are broken - https://phabricator.wikimedia.org/T406478
[16:33:36] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Assign campaignevents-generate-invitation-lists right explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194986 (https://phabricator.wikimedia.org/T401445)
[16:35:09] <wikibugs>	 (03PS3) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194978
[16:35:17] <Reedy>	 jouncebot: nowandnext
[16:35:17] <jouncebot>	 For the next 0 hour(s) and 24 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1600)
[16:35:17] <jouncebot>	 In 0 hour(s) and 24 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1700)
[16:35:17] <jouncebot>	 In 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1700)
[16:36:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194978 (owner: 10Reedy)
[16:38:45] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[16:40:10] <wikibugs>	 (03CR) 10Ottomata: "yes! TY!" [puppet] - 10https://gerrit.wikimedia.org/r/1194989 (https://phabricator.wikimedia.org/T221976) (owner: 10Vgutierrez)
[16:44:33] <logmsgbot>	 cmooney@cumin1003 netbox (PID 2014825) is awaiting input
[16:44:54] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[16:47:25] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for inter.link transit IPs in drmrs - cmooney@cumin1003"
[16:50:29] <logmsgbot>	 cmooney@cumin1003 netbox (PID 2014825) is awaiting input
[16:55:33] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] atftpd: Drop service definition [puppet] - 10https://gerrit.wikimedia.org/r/1194917 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[16:56:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:57:39] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for inter.link transit IPs in drmrs - cmooney@cumin1003"
[16:57:39] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:00:05] <jouncebot>	 bd808: OwO what's this, a deployment window?? Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1700). nyaa~
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T1700)
[17:00:39] <bd808>	 o/ I have a developer-portal build to push out today.
[17:02:38] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:04:05] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:05:15] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:06:04] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:06:30] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:07:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:08:32] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:22:06] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo)
[17:22:14] <wikibugs>	 (03PS6) 10Jcrespo: backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946)
[17:30:42] <logmsgbot>	 !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs1020.eqiad.wmnet with reason: downtime lvs1020 to supress alerts about enp94s0f0np0 going down and losing backend connectivity
[17:30:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11261370 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=366dc32d-e9d7-437c-98a7-7a9cd7979655) set by cmooney@cumin...
[17:31:35] <topranks>	 !log begin work to move lvs1020 uplink cable from ssw1-f1-eqiad to ssw1-e1-eqiad 
[17:31:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:06] <wikibugs>	 (03PS1) 10Ssingh: url_downloader: remove hcaptcha proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631)
[17:33:49] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7242/co" [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh)
[17:33:51] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/2 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:36:32] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:04-2] "The Phase 3 rollout of hCaptcha is on Wed Sep 15. I think we should not merge this until Sep 16 so that in case required, we can revert ba" [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh)
[17:38:51] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-eqiad:xe-0/0/33 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[17:39:02] <sukhe>	 ^ Cathal is working on this so expected
[17:40:03] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo)
[17:43:25] <topranks>	 sukhe: yeah sorry dc-ops too quick for me, I'll clear that now gimme another few 
[17:43:47] <sukhe>	 topranks: no worries at all on our end. take your time!
[17:44:50] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 988180768 and 67 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:47:50] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 27200 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:48:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406166#11261445 (10ssingh) 05Open→03Resolved a:03ssingh Rolled out.
[17:50:39] <wikibugs>	 (03PS21) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246)
[17:52:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[17:54:02] <wikibugs>	 (03PS22) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246)
[17:54:51] <wikibugs>	 10ops-codfw, 06DC-Ops: Alert for device ps1-a2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406911 (10phaultfinder) 03NEW
[17:57:48] <wikibugs>	 (03PS23) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246)
[18:01:47] <rzl>	 rolling out some envoy upgrades in staging
[18:02:26] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/apertium: apply
[18:02:41] <logmsgbot>	 !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/apertium: apply
[18:02:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:03:13] <rzl>	 suspiciously timed but not related
[18:03:46] <jynus>	 drms
[18:03:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194981 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[18:03:56] <jhathaway>	 o/
[18:04:06] <jhathaway>	 !incidents
[18:04:06] <sirenbot>	 6853 (ACKED)  [2x] ProbeDown sre (text-https:443 probes/service drmrs)
[18:04:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194986 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[18:04:13] <jynus>	 nel going up
[18:04:47] <fabfur>	 we had a spike in 4XX 
[18:06:09] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Add separate user right for invitation lists [extensions/CampaignEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195019 (https://phabricator.wikimedia.org/T401445)
[18:06:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CampaignEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195019 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[18:07:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:09:54] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:10:48] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 892324392 and 65 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:11:50] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 79736 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:18:51] <jinxer-wm>	 RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-eqiad:xe-0/0/33 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[18:24:54] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:27:02] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[18:30:42] <icinga-wm>	 PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit1003), No backups: 5 (gerrit1003, ...), Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[18:31:37] <wikibugs>	 (03PS1) 10Jforrester: i18n: Pull forward wikimedia-boardelection2025-notification-body updates [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195021
[18:35:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11261706 (10cmooney) 05Open→03Resolved Link has been moved, port is up on ssw1-e1-eqiad and MACs learnt on all vlans: ` cmooney...
[18:35:40] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.remove-downtime for lvs1020.eqiad.wmnet
[18:35:40] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1020.eqiad.wmnet
[18:36:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11261714 (10RobH) So I'm going to outline a few assumptions here and steal this back.  If any of the following assumptions are incorrect, please let me know.   We now have a move ahead...
[18:36:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11261715 (10RobH) a:05BCornwall→03RobH
[18:39:59] <James_F>	 jouncebot: nowandnext
[18:39:59] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 20 minute(s)
[18:39:59] <jouncebot>	 In 1 hour(s) and 20 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T2000)
[18:40:13] <James_F>	 OK, I'll get an i18n change out, as it'll take too much time for a window.
[18:40:29] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] i18n: Pull forward wikimedia-boardelection2025-notification-body updates [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195021 (owner: 10Jforrester)
[18:40:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11261725 (10RobH)
[18:43:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195021 (owner: 10Jforrester)
[18:44:40] <icinga-wm>	 RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Thu 06 Nov 2025 06:10:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[18:52:45] <wikibugs>	 (03Merged) 10jenkins-bot: i18n: Pull forward wikimedia-boardelection2025-notification-body updates [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195021 (owner: 10Jforrester)
[18:53:04] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1195021|i18n: Pull forward wikimedia-boardelection2025-notification-body updates]]
[18:58:52] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1195021|i18n: Pull forward wikimedia-boardelection2025-notification-body updates]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:59:48] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[19:04:43] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195021|i18n: Pull forward wikimedia-boardelection2025-notification-body updates]] (duration: 11m 39s)
[19:12:07] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM! I haven't tested sending message to private channels but it must be possible as the app includes the necessary scopes to do so. I su" [puppet] - 10https://gerrit.wikimedia.org/r/1194736 (owner: 10Cwhite)
[19:51:44] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[19:51:50] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 81103992 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:52:50] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 104240 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:59:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1018.eqiad.wmnet
[19:59:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1019.eqiad.wmnet
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T2000).
[20:00:05] <jouncebot>	 sbassett, Reedy, and Daimona: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:24] <Daimona>	 o/
[20:00:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs1020.eqiad.wmnet
[20:00:32] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s-worker2003: return to production role [puppet] - 10https://gerrit.wikimedia.org/r/1194305 (https://phabricator.wikimedia.org/T399778) (owner: 10Bking)
[20:00:41] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging in the interest of time." [puppet] - 10https://gerrit.wikimedia.org/r/1194305 (https://phabricator.wikimedia.org/T399778) (owner: 10Bking)
[20:01:04] <sbassett>	 o/
[20:01:17] <sbassett>	 Just FYI - the order of my patches matters, and currently it’s reverse-listed :)
[20:02:47] <Reedy>	 Who is deploying? :P
[20:02:49] <Reedy>	 I can..
[20:02:58] <wikibugs>	 (03PS1) 10CDanis: haproxy tls_terminator template cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1195041
[20:03:10] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195041 (owner: 10CDanis)
[20:03:23] <sbassett>	 Well, Roan is def OOO.
[20:03:31] <Reedy>	 Daimona: i18n changes? :P
[20:03:47] <Daimona>	 Yeeeeah
[20:03:49] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194978 (owner: 10Reedy)
[20:03:58] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Revert "Delete the event-organizer user group on medium and small wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194981 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[20:03:58] <Daimona>	 You have been warned :D
[20:04:08] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Assign campaignevents-generate-invitation-lists right explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194986 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[20:04:29] <wikibugs>	 (03CR) 10Reedy: [C:03+2] OATHAuth Recovery Code code improvement [extensions/OATHAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194962 (https://phabricator.wikimedia.org/T406501) (owner: 10SBassett)
[20:04:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2003.codfw.wmnet with OS bookworm
[20:04:42] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194978 (owner: 10Reedy)
[20:04:47] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Delete the event-organizer user group on medium and small wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194981 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[20:04:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), and 2 others: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11261952 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host dse-k8s-worker2003.cod...
[20:05:01] <wikibugs>	 (03Merged) 10jenkins-bot: Assign campaignevents-generate-invitation-lists right explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194986 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[20:05:29] <Reedy>	 Shove those three out while others merge
[20:06:31] <logmsgbot>	 !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1194978|Update interwiki cache]], [[gerrit:1194981|Revert "Delete the event-organizer user group on medium and small wikis" (T401445)]], [[gerrit:1194986|Assign campaignevents-generate-invitation-lists right explicitly (T401445)]]
[20:06:34] <stashbot>	 T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445
[20:07:16] <Daimona>	 I can take care of running createAndPromote, but I'm wondering: is there s way to delete log entries without manually messing with the DB? Since I'm basically reverting a group memberhsip change, I thought it'd be nice if we could delete all log entries that this ever happened
[20:07:48] <Daimona>	 I'm not sure if I want to make manual changes though
[20:08:50] <Reedy>	 I think it ends up being a specific purpose maintenance script written
[20:09:34] <Reedy>	 Daimona: do you care much about testing those two?
[20:09:48] <Daimona>	 I can do a quick test
[20:10:57] <logmsgbot>	 !log reedy@deploy2002 daimona, reedy: Backport for [[gerrit:1194978|Update interwiki cache]], [[gerrit:1194981|Revert "Delete the event-organizer user group on medium and small wikis" (T401445)]], [[gerrit:1194986|Assign campaignevents-generate-invitation-lists right explicitly (T401445)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:11:10] <wikibugs>	 (03Merged) 10jenkins-bot: OATHAuth Recovery Code code improvement [extensions/OATHAuth] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1194962 (https://phabricator.wikimedia.org/T406501) (owner: 10SBassett)
[20:12:57] <Daimona>	 Looks good
[20:13:03] <logmsgbot>	 !log reedy@deploy2002 daimona, reedy: Continuing with sync
[20:13:14] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] sshd: use the default KexAlgorithms algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1194734 (owner: 10JHathaway)
[20:15:01] <Daimona>	 Re deleting logs, deleting by primary key seems generic enough. I just don't know if there's anything somewhere that might reference those rows
[20:15:22] <Reedy>	 Are there many?
[20:15:40] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] cyberbot: use wmflib::debian_php_version to pick PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1193129 (owner: 10Dzahn)
[20:17:10] <wikibugs>	 (03CR) 10Dzahn: gerrit: local backup on source server only (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194949 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[20:17:17] <logmsgbot>	 !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194978|Update interwiki cache]], [[gerrit:1194981|Revert "Delete the event-organizer user group on medium and small wikis" (T401445)]], [[gerrit:1194986|Assign campaignevents-generate-invitation-lists right explicitly (T401445)]] (duration: 10m 46s)
[20:17:32] <stashbot>	 T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445
[20:17:32] <Reedy>	 that was a bit laggy
[20:18:03] <Daimona>	 100 log entries already there. Presumably 100 more once we revert the group membership change
[20:18:30] <logmsgbot>	 !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1194962|OATHAuth Recovery Code code improvement (T406501)]]
[20:18:33] <stashbot>	 T406501: OATHAuth Recovery Code code improvement suggestions - https://phabricator.wikimedia.org/T406501
[20:19:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker2003.codfw.wmnet with reason: host reimage
[20:19:45] <sbassett>	 Huh, weird unserialize() spike due to some template on strategywiki...
[20:20:23] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit: mod_qos revert to previous stable state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194811 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb)
[20:21:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit: mod_qos revert to previous stable state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194811 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb)
[20:22:47] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: mod_qos revert to previous stable state [puppet] - 10https://gerrit.wikimedia.org/r/1194811 (https://phabricator.wikimedia.org/T406403) (owner: 10Arnaudb)
[20:22:59] <logmsgbot>	 !log reedy@deploy2002 sbassett, reedy: Backport for [[gerrit:1194962|OATHAuth Recovery Code code improvement (T406501)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:23:43] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker2003.codfw.wmnet with reason: host reimage
[20:24:13] <Reedy>	 sbassett: Want to test much of this one?
[20:24:13] <sbassett>	 I know you know this, Reedy, but the OATH patch is a no-op until we deploy the config changes
[20:24:28] <logmsgbot>	 !log reedy@deploy2002 sbassett, reedy: Continuing with sync
[20:24:30] <sbassett>	 Ha, see ^
[20:25:50] <mutante>	 !log re-enabling QoS on gerrit servers - with previously stable config - T406774
[20:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:19] <mutante>	 misses one of the bots that updates a ticket
[20:28:49] <logmsgbot>	 !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1194962|OATHAuth Recovery Code code improvement (T406501)]] (duration: 10m 19s)
[20:28:54] <stashbot>	 T406501: OATHAuth Recovery Code code improvement suggestions - https://phabricator.wikimedia.org/T406501
[20:29:37] <mutante>	 !log re-enabled QoS on gerrit servers - with previously stable config - T406774  gerrit:1194811
[20:29:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:53] <mutante>	 !log logmsgbot do you still log - test log T284123
[20:32:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:57] <stashbot>	 T284123: tcpircbot-logmsgbot was not able to deliver messages - https://phabricator.wikimedia.org/T284123
[20:33:43] <wikibugs>	 (03PS2) 10Reedy: Enable New UI and Multiple Module support for OATHAuth in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193928 (https://phabricator.wikimedia.org/T399644) (owner: 10SBassett)
[20:33:51] <sbassett>	 here we go :)
[20:34:17] <wikibugs>	 (03CR) 10SBassett: [C:03+1] Enable New UI and Multiple Module support for OATHAuth in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193928 (https://phabricator.wikimedia.org/T399644) (owner: 10SBassett)
[20:34:18] <Reedy>	 sbassett: You'll have to remove your -2 ;)
[20:34:25] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Enable New UI and Multiple Module support for OATHAuth in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193928 (https://phabricator.wikimedia.org/T399644) (owner: 10SBassett)
[20:34:32] <sbassett>	 …and done
[20:35:17] <wikibugs>	 (03Merged) 10jenkins-bot: Enable New UI and Multiple Module support for OATHAuth in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193928 (https://phabricator.wikimedia.org/T399644) (owner: 10SBassett)
[20:37:52] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 312192288 and 24 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[20:37:55] <logmsgbot>	 !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1193928|Enable New UI and Multiple Module support for OATHAuth in Wikimedia production (T399644)]]
[20:38:02] <stashbot>	 T399644: FY2025-26 WE4.6.2 Multiple Authenticators - https://phabricator.wikimedia.org/T399644
[20:38:20] <Daimona>	 So for the createAndPromote. I need to run a script on multiple wikis with different arguments for each invocation. Surely there must be a better way than invoking mwscript-k8s 100 times as the documentation says NOT to do?
[20:38:52] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 202776 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[20:39:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations: nodesource node22 apt mirror is broken - https://phabricator.wikimedia.org/T406623#11262222 (10Dzahn)
[20:39:02] <Daimona>	 (The lame 100-invocation way is https://phabricator.wikimedia.org/P83722#336349)
[20:39:28] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927 (10Lars) 03NEW
[20:40:28] <Reedy>	 Daimona: Probably not
[20:41:38] <Daimona>	 Ah well
[20:41:49] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] "Great! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1194736 (owner: 10Cwhite)
[20:42:04] <logmsgbot>	 !log reedy@deploy2002 reedy, sbassett: Backport for [[gerrit:1193928|Enable New UI and Multiple Module support for OATHAuth in Wikimedia production (T399644)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:42:05] <Daimona>	 Any SREs who could confirm and give a green light to run https://phabricator.wikimedia.org/P83722#336349 then?
[20:42:24] <sbassett>	 OATH config patch has landed on the k8s-mwdebugs.  So far looking good1
[20:42:45] <Reedy>	 Daimona: Just do it tbh... Unless you write a script that does createandpromote for N users per wiki...
[20:42:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker2003.codfw.wmnet with OS bookworm
[20:42:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11262268 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host dse-k8s-worker2003.c...
[20:43:37] <Daimona>	 Yeah I just wanna make sure I don't bring everything down.
[20:44:15] <Reedy>	 It's more it's just not very efficient spinning up the workers for short lived jobs like this
[20:44:21] <wikibugs>	 (03PS1) 10Bking: dse-k8s-eqiad: Add records for opensearch-test and opensearch-ipoid [dns] - 10https://gerrit.wikimedia.org/r/1195048 (https://phabricator.wikimedia.org/T357753)
[20:44:58] <Daimona>	 Yeah... Also, I'm prepared for this to take ages
[20:45:06] <Reedy>	 It won't take that long tbh
[20:45:14] <Reedy>	 But if you do them in smaller batches...
[20:46:17] <Daimona>	 !log Run createAndPromote as in P83722#336349 (~100x, in series) to restore event-organizer membership # T401445
[20:46:17] <Reedy>	 sbassett: The fonts on Special:AccountSecurity look a bit weird
[20:46:31] <Reedy>	 But I suspect that may be a bit RL module weirdness
[20:46:34] <Daimona>	 Well, we shall see
[20:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:00] <stashbot>	 T401445: Update Event Registration (organizer side) to be available for all autoconfirmed users - Thursday, Oct 9 - https://phabricator.wikimedia.org/T401445
[20:47:15] <sbassett>	 Reedy: Really? They seem ok in Chrome/MacOS for me? Or at least not noticeably different than mw-docker, patchdemo, beta...
[20:47:24] <Reedy>	 check slack
[20:48:01] <Reedy>	 It's not something I'm massively worried about though
[20:48:13] <Reedy>	 >Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content.
[20:48:16] <Reedy>	 That may be an actual bug
[20:48:50] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[20:49:12] <wikibugs>	 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on dse-k8s-worker1003:9290 - https://phabricator.wikimedia.org/T406929 (10phaultfinder) 03NEW
[20:49:27] <Reedy>	 I see a new recovery code has been DB persisted along side my TOTP
[20:49:32] <sbassett>	 Reedy: so when you reload it looks ok or…?
[20:50:03] <wikibugs>	 (03PS7) 10LorenMora: Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999)
[20:50:07] <sbassett>	 Hooray, that’s working.  I’ve tested several add/remove key, use new key workflows at this point.  I’m not seeing any backend issues in Chrome/MacOS...
[20:50:33] <Reedy>	 Nope
[20:50:40] <Reedy>	 I bet that's a timeless skin issue
[20:50:40] <sbassett>	 I’ve been able to use recovery codes and have them persist when a new one gets created as well.
[20:50:42] <Reedy>	 I'll file a bug
[20:50:52] <wikibugs>	 (03PS1) 10Jdlrobson: Enable instrumentation of watchstar and other links that stopPropagation [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195049 (https://phabricator.wikimedia.org/T406390)
[20:51:02] <sbassett>	 Yeah, has to be something at the skin or codex layer.  If it looked a lot worse, I might hold off, but...
[20:51:50] <sbassett>	 The UI in Vector seems as intended/expected for me.
[20:51:54] <Reedy>	 Happy to continue?
[20:52:16] <sbassett>	 I am, if we think minor color/font issues in some skins isn’t that big of a deal and/or can be addressed in the near future.
[20:53:20] <sbassett>	 Personally, I’m far more worried about any backend/workflow issues, which I’m not seeing any for now.  But that’s just MO.
[20:53:34] <Reedy>	 If it was broken on vector/vector22/minerva, I'd be less happy to continue
[20:53:41] <sbassett>	 Yes, agreed.
[20:53:49] <Reedy>	 But that's also what it was more tested on, soo...
[20:53:56] <logmsgbot>	 !log reedy@deploy2002 reedy, sbassett: Continuing with sync
[20:54:00] <Reedy>	 jouncebot: nowandnext
[20:54:00] <jouncebot>	 For the next 0 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T2000)
[20:54:00] <jouncebot>	 In 0 hour(s) and 5 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T2100)
[20:54:20] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Add separate user right for invitation lists [extensions/CampaignEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195019 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[20:57:45] <tzatziki>	 Uhoh is votewiki down? https://vote.wikimedia.org/wiki/Main_Page
[20:57:57] <Reedy>	 tzatziki: Down how?
[20:57:59] <logmsgbot>	 !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193928|Enable New UI and Multiple Module support for OATHAuth in Wikimedia production (T399644)]] (duration: 20m 04s)
[20:58:03] <stashbot>	 T399644: FY2025-26 WE4.6.2 Multiple Authenticators - https://phabricator.wikimedia.org/T399644
[20:58:03] <tzatziki>	 Original exception: [15fe86b3-3935-44ca-94fa-ef8e107a2e62] 2025-10-09 20:57:49: Fatal exception of type "TypeError"
[20:58:24] <Reedy>	 >TypeError: MediaWiki\Extension\OATHAuth\Key\TOTPKey::__construct(): Argument #3 ($recoveryCodes) must be of type array, string given, called in /srv/mediawiki/php-1.45.0-wmf.22/extensions/OATHAuth/src/Key/TOTPKey.php on line 125
[20:58:40] <sbassett>	 Ugh
[20:58:42] <Reedy>	 w
[20:58:47] <Reedy>	 Why is that on votewiki
[20:58:50] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:58:54] <tzatziki>	 oh, this is a regression as of like two minutes ago?
[20:58:58] <tzatziki>	 nice
[20:59:12] <tzatziki>	 shall I file a bug?
[20:59:19] <sbassett>	 An issue with non-SUL wikis?
[20:59:40] <Reedy>	 Kinda looks like it
[20:59:58] <sbassett>	 I don’t have a votewiki account, I don’t think.
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251009T2100)
[21:00:15] <tzatziki>	 https://phabricator.wikimedia.org/T406933 
[21:00:19] <Reedy>	 tzatziki: I think it may "only" be an issue for you logged in people...
[21:00:19] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for JavierMonton - https://phabricator.wikimedia.org/T406331#11262360 (10Ahoelzl)
[21:00:21] <Reedy>	 https://phabricator.wikimedia.org/T406932
[21:00:40] <tzatziki>	 oh, lol, snap. I'll merge my dupe
[21:01:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for JavierMonton - https://phabricator.wikimedia.org/T406331#11262364 (10Ahoelzl) @RLazarus sorry for the direct ping, who could help with this?
[21:01:09] * Jdlrobson prepares for  Web Team deployment window
[21:01:15] <sbassett>	 officewiki seems to work alright...
[21:01:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for a-pizzata - https://phabricator.wikimedia.org/T406328#11262370 (10Ahoelzl)
[21:01:26] <wikibugs>	 (03PS24) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246)
[21:01:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for a-pizzata - https://phabricator.wikimedia.org/T406328#11262372 (10Ahoelzl) @RLazarus sorry for the direct ping, who could help with this?
[21:01:36] <tzatziki>	 Collabwiki also works, ftr
[21:01:39] <Reedy>	 Obviously it depends on traffic, but I'm only seeing this on vote
[21:01:52] <Reedy>	 oh fucking lol
[21:01:52] <wikibugs>	 (03PS8) 10LorenMora: Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999)
[21:01:53] <Reedy>	 I see why
[21:01:59] <Reedy>	 No one ever ran a prior maintenance script
[21:02:00] <tzatziki>	 :D 
[21:02:04] <taavi>	 whoops
[21:02:11] <Reedy>	 so there's recovery codes that are strings with commas
[21:02:15] <tzatziki>	 poor urchin votewiki
[21:02:21] <wikibugs>	 (03CR) 10LorenMora: Add ReadingList Stream to EventStreamConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora)
[21:02:24] <Daimona>	 My script is ~40 done
[21:02:32] <taavi>	 Reedy: that's, uh, how from how many years ago?
[21:02:43] <Reedy>	 taavi: Did we delete that one for string to array for recovery codes?
[21:02:46] <sbassett>	 Ok, so not something I broke?
[21:02:50] <taavi>	 I have a feeling we might have
[21:02:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[21:02:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:02:55] <Reedy>	 sbassett: yes and no
[21:02:55] <taavi>	 let' see
[21:03:09] <Reedy>	 You've coded for the latest behaviour
[21:03:15] <Reedy>	 Probably removing a back compat thing along the way
[21:03:21] <sbassett>	 Oh fun
[21:03:27] <Reedy>	 This affects ~6 users
[21:03:41] <sbassett>	 Ok, so no rollback.
[21:03:42] <tzatziki>	 yeah can you just disable their 2fa and I'll contact them
[21:03:43] <taavi>	 yeah, we dropped the script in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OATHAuth/+/1189123
[21:03:44] * Reedy looks at tzatziki
[21:03:52] <tzatziki>	 if that's a solution
[21:03:53] <Reedy>	 https://github.com/wikimedia/mediawiki-extensions-OATHAuth/blob/REL1_43/maintenance/UpdateTOTPScratchTokensToArray.php
[21:03:56] <Reedy>	 It's in 1.43
[21:03:58] <tzatziki>	 yes I'm one of them :(  lol
[21:04:09] <Reedy>	 Give me a few mins
[21:04:12] <taavi>	 I'm guessing we have this in other non-votewiki private wikis?
[21:04:14] <sbassett>	 tx, Reedy
[21:04:21] <wikibugs>	 (03PS1) 10Dzahn: zuul: add host network to docker command for new zuul-web component [puppet] - 10https://gerrit.wikimedia.org/r/1195053 (https://phabricator.wikimedia.org/T405119)
[21:04:37] <wikibugs>	 (03Merged) 10jenkins-bot: Add separate user right for invitation lists [extensions/CampaignEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195019 (https://phabricator.wikimedia.org/T401445) (owner: 10Daimona Eaytoy)
[21:04:42] <Reedy>	 I suspect I should run it on other wikis too...
[21:04:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] zuul: add host network to docker command for new zuul-web component [puppet] - 10https://gerrit.wikimedia.org/r/1195053 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn)
[21:05:18] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: add host network to docker command for new zuul-web component [puppet] - 10https://gerrit.wikimedia.org/r/1195053 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn)
[21:05:26] <wikibugs>	 (03PS2) 10Dzahn: zuul: add host network to docker command for new zuul-web component [puppet] - 10https://gerrit.wikimedia.org/r/1195053 (https://phabricator.wikimedia.org/T405119)
[21:05:47] <Reedy>	 And of course the script is still targetting oathauth_users :D
[21:05:49] <Reedy>	 easily fixed
[21:05:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195049 (https://phabricator.wikimedia.org/T406390) (owner: 10Jdlrobson)
[21:07:57] <Reedy>	 tzatziki: How about now?
[21:08:00] <sbassett>	 Seeing 8 TypeErrors in logstash for this issue rn, most of which I’m guessing are from the people in this chat :)
[21:08:11] <Reedy>	 ah, no, it's not fixed
[21:08:16] <tzatziki>	 Reedy: --
[21:08:18] <tzatziki>	 well. :D 
[21:08:31] <tzatziki>	 Yes, I've probably refreshed 8 times. lol
[21:08:43] <tzatziki>	 It's by no means urgent btw. I just needed to see the voter list
[21:08:45] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: add host network to docker command for new zuul-web component [puppet] - 10https://gerrit.wikimedia.org/r/1195053 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn)
[21:08:50] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:11:12] <sbassett>	 Oof, seeing some more action in logstash for votewiki, which I assume is related to Reedy trying to fix things.  Worst-case is maybe we temp disable the new UI/module support for votewiki or private wikis?
[21:11:35] <Reedy>	 tzatziki: Third time lucky
[21:11:47] <tzatziki>	 nope :(  
[21:11:47] <tzatziki>	 Original exception: [64c245a2-e0c4-4223-9fa2-58555714ec99] 2025-10-09 21:11:40: Fatal exception of type "TypeError"
[21:11:49] <Reedy>	 or...
[21:13:06] <wikibugs>	 (03Merged) 10jenkins-bot: Enable instrumentation of watchstar and other links that stopPropagation [extensions/WikimediaEvents] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195049 (https://phabricator.wikimedia.org/T406390) (owner: 10Jdlrobson)
[21:14:00] <Reedy>	 tzatziki: ok
[21:14:03] <Reedy>	 *now* it is fixed
[21:14:12] <tzatziki>	 Reedy: wahooo
[21:14:17] <tzatziki>	 indeed it works now!
[21:14:18] <sbassett>	 Hooray!
[21:15:04] <sbassett>	 So we need to run that maint script for a few more private wikis, most likely, then?
[21:15:39] <Reedy>	 anything non CA
[21:15:55] <Reedy>	 Daimona: Want your other patch deploying now? :P
[21:17:06] <Reedy>	 done for private
[21:17:10] <Daimona>	 Yup pls
[21:17:47] <Daimona>	 Script is about 85% done
[21:17:49] <Reedy>	 Done for fishbowl
[21:18:00] <Reedy>	 We don't have any other non SUL groups do we...
[21:18:13] <sbassett>	 Shouldn’t be many
[21:18:28] <Reedy>	 ah, Jdlrobson is using their window :P
[21:18:53] <Reedy>	 https://noc.wikimedia.org/conf/highlight.php?file=dblists/sul.dbexpr
[21:18:58] <Reedy>	 >all.dblist + preinstall.dblist - fishbowl.dblist - private.dblis
[21:19:22] <Jdlrobson>	 ahh Reedy is that why I'm seeing "here were unexpected commits pulled from origin for /srv/mediawiki-staging/php-1.45.0-wmf.22 Continue with deployment (all patches will be deployed)? [y/N]:" ? 
[21:19:52] <Jdlrobson>	 I'm not 100% sure what I should be doing now
[21:20:12] <Jdlrobson>	 I assume no is the correct answer?
[21:20:24] <Reedy>	 Are you deploying more than one patch (in series)?
[21:20:27] <Jdlrobson>	 just one patch
[21:20:33] <Reedy>	 I didn't start running mine till ~14 mins after the lock
[21:20:42] <Jdlrobson>	 and it wasn't working in my testing so I'm not sure what's happened
[21:20:49] <Reedy>	 And scap hasn't checked it out
[21:21:17] <Jdlrobson>	 I am using https://spiderpig.wikimedia.org/ if you want to take a look
[21:21:21] <Reedy>	 Jdlrobson: I'm pretty sure you can just continue, it's just kinda saying "hey, there's some other stuff that you might have meant to deploy"
[21:22:28] <logmsgbot>	 !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1195049|Enable instrumentation of watchstar and other links that stopPropagation (T406390)]]
[21:22:39] <Reedy>	 oh... dancy about?
[21:22:45] <Reedy>	 and/or thcipriani...
[21:22:54] <Reedy>	 Does spiderpig just try and deploy all the things outstanding?
[21:23:00] <stashbot>	 T406390: Pull data on watchlist star usage - https://phabricator.wikimedia.org/T406390
[21:23:25] <wikibugs>	 (03PS1) 10Dzahn: zuul: add missing $host_ip variable to zuul-web class [puppet] - 10https://gerrit.wikimedia.org/r/1195062 (https://phabricator.wikimedia.org/T405119)
[21:24:14] <wikibugs>	 (03PS2) 10Dzahn: zuul: add missing $host_ip variable to zuul-web class [puppet] - 10https://gerrit.wikimedia.org/r/1195062 (https://phabricator.wikimedia.org/T405119)
[21:24:30] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: add missing $host_ip variable to zuul-web class [puppet] - 10https://gerrit.wikimedia.org/r/1195062 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn)
[21:24:45] <dancy>	 That's right. If you answer yes, it deploys whatever is merged 
[21:24:58] <logmsgbot>	 marostegui@cumin1003 clone_es (PID 1943562) is awaiting input
[21:25:00] <Jdlrobson>	 Reedy: it looks like some CampaignEvents changes?
[21:25:18] <TimStarling>	 !log on db2202 cleaned up the tables I created for T400696
[21:25:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:23] <stashbot>	 T400696: FY25-26 WE1.4.1 RecentChanges database performance improvements - https://phabricator.wikimedia.org/T400696
[21:25:26] <Reedy>	 Yeah, I'd +2'd https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CampaignEvents/+/1195019 and then we had a 2FA issue on officewiki
[21:25:27] <Daimona>	 My run of createAndPromote is done
[21:25:33] <Reedy>	 *votewiki
[21:27:36] <Jdlrobson>	 Reedy: okay so I guess that's now being deployed? Was that tested or do we need to abort this?
[21:28:30] <Jdlrobson>	 I see I have options "Interrupt job" and "Kill job (not recommended)" but this is beyond my spiderpig training. I also need to step out in the next 10 mins for a doctor visit
[21:29:12] <Reedy>	 I'd hope the master patch was tested :)
[21:30:09] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1195062 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn)
[21:32:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:34:02] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:34:06] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: bring internal-scholarly wdqs2017 into svc [puppet] - 10https://gerrit.wikimedia.org/r/1195063 (https://phabricator.wikimedia.org/T405978)
[21:35:04] <wikibugs>	 (03CR) 10Bking: [C:03+1] wdqs: bring internal-scholarly wdqs2017 into svc [puppet] - 10https://gerrit.wikimedia.org/r/1195063 (https://phabricator.wikimedia.org/T405978) (owner: 10Ryan Kemper)
[21:35:09] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: bring internal-scholarly wdqs2017 into svc [puppet] - 10https://gerrit.wikimedia.org/r/1195063 (https://phabricator.wikimedia.org/T405978) (owner: 10Ryan Kemper)
[21:41:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply Redundancy alert on db1241 - https://phabricator.wikimedia.org/T406863#11262521 (10VRiley-WMF) 05Open→03Resolved This has been fixed. Loose cable.
[21:41:37] <wikibugs>	 (03PS5) 10Scott French: api-gateway: Add support for PHP_ENGINE cookie routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955)
[21:41:37] <wikibugs>	 (03PS5) 10Scott French: rest-gateway: Divert PHP_ENGINE=8.3 requests to mw-api-ext-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194791 (https://phabricator.wikimedia.org/T405955)
[21:43:26] <Jdlrobson>	 dancy: it's taking a lot longer than usual so it's making me a little nervous. I need to go to a doctors appointment so can I pass to you if it overruns significantly longer?
[21:43:44] <Reedy>	 it's taking longer because of i18n updates
[21:44:01] <Reedy>	 it's got past the docker image builds now
[21:44:07] <Jdlrobson>	 yep which I hadn't planned for :)
[21:44:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - Status - issue on dse-k8s-worker1003:9290 - https://phabricator.wikimedia.org/T406929#11262530 (10VRiley-WMF) a:03VRiley-WMF
[21:44:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - Status - issue on dse-k8s-worker1003:9290 - https://phabricator.wikimedia.org/T406929#11262531 (10VRiley-WMF) 05Open→03Resolved This is resolved. Loose power cable
[21:47:05] <Jdlrobson>	 ok looks like its almost done. I should be able to hang around and still make my appointment
[21:47:54] <logmsgbot>	 !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1195049|Enable instrumentation of watchstar and other links that stopPropagation (T406390)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:47:57] <stashbot>	 T406390: Pull data on watchlist star usage - https://phabricator.wikimedia.org/T406390
[21:49:48] <Daimona>	 Yeah sorry, unfortunately I don't think it was possible to do this without the i18n updates
[21:50:10] <Jdlrobson>	 okay changes are on test servers. Mine are working
[21:50:19] <Jdlrobson>	 @Daimona can you test?
[21:50:35] <Jdlrobson>	 Apparently they only synced to test servers
[21:51:24] <Daimona>	 Yep, appears to be working correctly, thank you!
[21:51:27] <logmsgbot>	 !log jdlrobson@deploy2002 jdlrobson: Continuing with sync
[21:51:29] <Jdlrobson>	 ok syncing
[21:57:43] <wikibugs>	 (03PS25) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246)
[21:57:53] <wikibugs>	 (03CR) 10Scott French: "Many thanks to you both for the review, as well as for the tip about the rest-gateway development setup. The latter helped identify two is" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1194790 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[21:59:38] <Jdlrobson>	 ok i really have to go now
[21:59:47] <Jdlrobson>	 could someone keep an eye on spiderpig for me while I'm gone?
[22:00:00] <Jdlrobson>	 Reedy: ?
[22:00:07] <Reedy>	 I'm watching, yeah
[22:00:10] <Reedy>	 Sorry about this!
[22:00:25] <Jdlrobson>	 thank you and appreciate you! gotta run
[22:00:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1018.eqiad.wmnet
[22:02:43] <jinxer-wm>	 FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[22:04:06] <logmsgbot>	 !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1195049|Enable instrumentation of watchstar and other links that stopPropagation (T406390)]] (duration: 41m 38s)
[22:04:10] <stashbot>	 T406390: Pull data on watchlist star usage - https://phabricator.wikimedia.org/T406390
[22:04:54] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1020.eqiad.wmnet
[22:05:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs1019.eqiad.wmnet
[22:07:43] <jinxer-wm>	 FIRING: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[22:08:50] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[22:09:02] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[22:09:35] <Daimona>	 Thanks Jon!
[22:11:22] <inflatador>	 !log bking@wdqs10(18|19|20) systemctl start load-categories-daily.service T405978
[22:11:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:11:26] <stashbot>	 T405978: Re-image remaining full graph hosts to post-graph-split roles - https://phabricator.wikimedia.org/T405978
[22:11:32] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add minimal tests for new zuul-web [puppet] - 10https://gerrit.wikimedia.org/r/1195067 (https://phabricator.wikimedia.org/T405119)
[22:12:40] <wikibugs>	 (03PS2) 10Dzahn: httpbb: add minimal tests for new zuul-web [puppet] - 10https://gerrit.wikimedia.org/r/1195067 (https://phabricator.wikimedia.org/T405119)
[22:13:48] <wikibugs>	 (03PS3) 10Dzahn: httpbb: add minimal tests for new zuul-web [puppet] - 10https://gerrit.wikimedia.org/r/1195067 (https://phabricator.wikimedia.org/T405119)
[22:13:50] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-categories.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[22:13:55] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:22:38] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1195067/7244/deploy1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1195067 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn)
[22:24:02] <jinxer-wm>	 FIRING: KubernetesCalicoDown: dse-k8s-worker2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[22:28:50] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:28:50] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:29:28] <Daimona>	 reedy: any idea how much longer it might take?
[22:29:39] <Reedy>	 Daimona: How much longer for what?
[22:29:54] <Daimona>	 Isn't the deployment still running?
[22:29:59] <Reedy>	 It finished 25 mins ago
[22:30:15] <Daimona>	 ... You can tell it hasn't been a good day for me lol
[22:30:23] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add missing directory for new zuul tests [puppet] - 10https://gerrit.wikimedia.org/r/1195073 (https://phabricator.wikimedia.org/T405119)
[22:31:02] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] httpbb: add missing directory for new zuul tests [puppet] - 10https://gerrit.wikimedia.org/r/1195073 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn)
[22:31:20] <Daimona>	 (In my defense, the SAL entry doesn't link to my patch)
[22:32:32] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195073" [puppet] - 10https://gerrit.wikimedia.org/r/1195067 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn)
[22:33:13] <jinxer-wm>	 RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[22:33:37] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1195073 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn)
[22:34:51] <mutante>	 leeeeroy jenkins
[23:01:37] <logmsgbot>	 marostegui@cumin1003 clone_es (PID 1943886) is awaiting input
[23:10:33] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2017.*
[23:13:50] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:18:50] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:38:32] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1195079
[23:38:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1195079 (owner: 10TrainBranchBot)
[23:51:38] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1195079 (owner: 10TrainBranchBot)