[00:15:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [00:15:44] Deployment linkrecommendation-internal in linkrecommendation at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ... [00:15:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [00:21:38] (03PS3) 10RLazarus: abstractwiki-rust: Fetch cargo-chef from its own vendored-sources repo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1297234 (https://phabricator.wikimedia.org/T427990) (owner: 10Jforrester) [00:22:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:24:51] (03CR) 10RLazarus: [V:03+2 C:03+2] "Built locally." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1297234 (https://phabricator.wikimedia.org/T427990) (owner: 10Jforrester) [00:39:34] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:09:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.47.0-wmf.6 [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1298950 (https://phabricator.wikimedia.org/T423915) [01:09:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1298951 [01:09:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1298951 (owner: 10TrainBranchBot) [01:09:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.47.0-wmf.6 [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1298950 (https://phabricator.wikimedia.org/T423915) (owner: 10TrainBranchBot) [01:14:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:21:54] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1298951 (owner: 10TrainBranchBot) [01:22:00] (03Merged) 10jenkins-bot: Branch commit for wmf/1.47.0-wmf.6 [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1298950 (https://phabricator.wikimedia.org/T423915) (owner: 10TrainBranchBot) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T0200) [02:01:40] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:08:19] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 38s) [02:08:58] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:32] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [02:20:26] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:58] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T0300) [03:01:59] (03PS1) 10TrainBranchBot: testwikis to 1.47.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298956 (https://phabricator.wikimedia.org/T423915) [03:02:01] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298956 (https://phabricator.wikimedia.org/T423915) (owner: 10TrainBranchBot) [03:03:03] (03Merged) 10jenkins-bot: testwikis to 1.47.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298956 (https://phabricator.wikimedia.org/T423915) (owner: 10TrainBranchBot) [03:03:28] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.47.0-wmf.6 refs T423915 [03:03:33] T423915: 1.47.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T423915 [03:11:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:14:44] PROBLEM - MariaDB Replica Lag: m2 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [03:14:55] 07sre-alert-triage, 06ServiceOps new: Alert in need of triage: ProbeDown (instance sophroid:4252) - https://phabricator.wikimedia.org/T428133#11997246 (10RLazarus) Chatted with @Scott_French about this today. The cause is Sophroid's [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/a2f4c186... [03:15:42] RECOVERY - MariaDB Replica Lag: m2 on db2160 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [03:23:47] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1298746 (owner: 10L10n-bot) [03:35:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [03:35:44] Deployment linkrecommendation-internal in linkrecommendation at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ... [03:35:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [03:40:44] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.47.0-wmf.6 refs T423915 (duration: 37m 16s) [03:40:48] T423915: 1.47.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T423915 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T0400) [04:01:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:02:51] !log mwpresync@deploy1003 Pruned MediaWiki: 1.47.0-wmf.3 (duration: 02m 43s) [04:06:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:39:34] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:45:28] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 2 (phab2003, ...), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:55:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:01:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:06:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:14:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:18:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 16 hosts with reason: Primary switchover x1 T428158 [05:18:38] T428158: Switchover x1 master (db1237 -> db1220) - https://phabricator.wikimedia.org/T428158 [05:19:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1220 with weight 0 T428158', diff saved to https://phabricator.wikimedia.org/P93932 and previous config saved to /var/cache/conftool/dbconfig/20260609-051859-marostegui.json [05:19:27] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1297683 (https://phabricator.wikimedia.org/T428158) [05:21:36] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1297683 (https://phabricator.wikimedia.org/T428158) (owner: 10Gerrit maintenance bot) [05:22:12] !log Starting x1 eqiad failover from db1237 to db1220 - T428158 [05:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set x1 eqiad as read-only for maintenance - T428158', diff saved to https://phabricator.wikimedia.org/P93933 and previous config saved to /var/cache/conftool/dbconfig/20260609-052253-marostegui.json [05:23:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1220 to x1 primary and set section read-write T428158', diff saved to https://phabricator.wikimedia.org/P93934 and previous config saved to /var/cache/conftool/dbconfig/20260609-052311-marostegui.json [05:23:38] (03CR) 10Marostegui: [C:03+2] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1297684 (https://phabricator.wikimedia.org/T428158) (owner: 10Gerrit maintenance bot) [05:23:42] !log marostegui@dns1004 START - running authdns-update [05:24:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1237 T428158', diff saved to https://phabricator.wikimedia.org/P93935 and previous config saved to /var/cache/conftool/dbconfig/20260609-052420-marostegui.json [05:24:25] T428158: Switchover x1 master (db1237 -> db1220) - https://phabricator.wikimedia.org/T428158 [05:26:22] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [05:27:14] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [05:27:22] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1237: Upgrading db1237.eqiad.wmnet [05:27:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1237: Upgrading db1237.eqiad.wmnet [05:28:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - other - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:30:32] marostegui@cumin1003 major-upgrade (PID 2077501) is awaiting input [05:37:37] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:39] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:39] marostegui@cumin1003 major-upgrade (PID 2077501) is awaiting input [05:37:39] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:39] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:39] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:41] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:41] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:41] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:41] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:41] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:41] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:41] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:42] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:43] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:43] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 6567d2f16ad6d9a4357b06ece55850c779d4882e, dns.git is 006a520a9bd9f3b230461b6b7a45ce6fec8663e2) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [05:37:50] !log marostegui@dns1004 START - running authdns-update [05:40:14] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1237.eqiad.wmnet with OS trixie [05:43:57] FIRING: [3x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:48:57] RESOLVED: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T0600) [06:00:05] marostegui, Amir1, and federico3: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T0600). [06:01:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 40 hosts with reason: Primary switchover s4 T426086 [06:01:08] T426086: Switchover s4 master (db1160 -> db1244) - https://phabricator.wikimedia.org/T426086 [06:01:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db1244 with weight 0 T426086', diff saved to https://phabricator.wikimedia.org/P93937 and previous config saved to /var/cache/conftool/dbconfig/20260609-060121-fceratto.json [06:09:50] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1286409 (https://phabricator.wikimedia.org/T426086) (owner: 10Gerrit maintenance bot) [06:10:41] !log Starting s4 eqiad failover from db1160 to db1244 - T426086 [06:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:45] T426086: Switchover s4 master (db1160 -> db1244) - https://phabricator.wikimedia.org/T426086 [06:11:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T426086', diff saved to https://phabricator.wikimedia.org/P93938 and previous config saved to /var/cache/conftool/dbconfig/20260609-061131-fceratto.json [06:12:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db1244 to s4 primary and set section read-write T426086', diff saved to https://phabricator.wikimedia.org/P93939 and previous config saved to /var/cache/conftool/dbconfig/20260609-061222-fceratto.json [06:14:43] !log cscott@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [06:15:12] !log cscott@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [06:15:13] !log cscott@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [06:15:26] (03CR) 10Federico Ceratto: [C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286411 (https://phabricator.wikimedia.org/T426086) (owner: 10Gerrit maintenance bot) [06:15:27] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:15:43] !log cscott@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [06:16:53] !log cscott@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [06:16:56] !log cscott@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [06:16:57] !log cscott@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [06:17:00] !log cscott@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [06:19:49] (03PS2) 10Federico Ceratto: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286411 (https://phabricator.wikimedia.org/T426086) (owner: 10Gerrit maintenance bot) [06:20:12] (03CR) 10Federico Ceratto: [C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286411 (https://phabricator.wikimedia.org/T426086) (owner: 10Gerrit maintenance bot) [06:20:39] (03CR) 10CI reject: [V:04-1] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286411 (https://phabricator.wikimedia.org/T426086) (owner: 10Gerrit maintenance bot) [06:20:41] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:21:34] FIRING: DiskSpace: Disk space ganeti1039:9100:/ 3.908% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ganeti1039 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:24:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool db1160 T426086', diff saved to https://phabricator.wikimedia.org/P93940 and previous config saved to /var/cache/conftool/dbconfig/20260609-062412-fceratto.json [06:24:18] T426086: Switchover s4 master (db1160 -> db1244) - https://phabricator.wikimedia.org/T426086 [06:28:10] 10ops-eqiad, 06DBA, 06DC-Ops: db1237 not rebooting - https://phabricator.wikimedia.org/T428542 (10Marostegui) 03NEW [06:30:50] 06SRE, 10DNS, 06Traffic: authdns-update failing - https://phabricator.wikimedia.org/T428541#11997424 (10FCeratto-WMF) This is also preventing https://gerrit.wikimedia.org/r/c/operations/dns/+/1286411 from being merged [06:35:02] 06SRE, 10DNS, 06Traffic: authdns-update failing - https://phabricator.wikimedia.org/T428541#11997431 (10Marostegui) p:05Triage→03High [06:36:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 not rebooting - https://phabricator.wikimedia.org/T428542#11997432 (10Marostegui) p:05Triage→03Medium [06:36:34] RESOLVED: DiskSpace: Disk space ganeti1039:9100:/ 2.352% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ganeti1039 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:36:37] (03Abandoned) 10Arnaudb: tcpproxy: add support for gitlab-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [06:37:08] (03Abandoned) 10Arnaudb: conftool-data: add tcp-proxy gitlab service [puppet] - 10https://gerrit.wikimedia.org/r/1290729 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [06:37:47] (03PS1) 10Muehlenhoff: Update access medadata lwilson-ctr [puppet] - 10https://gerrit.wikimedia.org/r/1299298 [06:39:12] (03PS9) 10Arnaudb: service: add gitlab-https and gitlab-ssh service to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1290684 (https://phabricator.wikimedia.org/T425441) [06:39:41] (03PS9) 10Arnaudb: lvs7003: add gitlab-ssh and gitlab-https [puppet] - 10https://gerrit.wikimedia.org/r/1291898 (https://phabricator.wikimedia.org/T425441) [06:40:16] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 not rebooting - https://phabricator.wikimedia.org/T428542#11997441 (10Jclark-ctr) a:03Jclark-ctr [06:42:17] (03PS20) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [06:42:23] (03CR) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [06:56:32] (03CR) 10Arnaudb: [C:03+1] gerrit: ensure error_log.json, sshd_log.json are always shipped to ELK [puppet] - 10https://gerrit.wikimedia.org/r/1298931 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [06:58:15] (03CR) 10Jelto: gitlab: add gitlab-ssh.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1298744 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [06:58:20] 06SRE, 10DNS, 06Traffic: authdns-update failing - https://phabricator.wikimedia.org/T428541#11997473 (10AlexisJazz) I'm seeing "Failed to fetch notifications." when I try to view my notifications on enwiktionary, is this related? I'm also seeing ``https://en.wiktionary.org/api/rest_v1/page/html/non_sequitur... [07:00:05] Amir1, urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:49] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1237.eqiad.wmnet with OS trixie [07:00:49] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [07:02:31] (03CR) 10Mszwarc: [C:03+1] Add 2FA enforcement demotion config for phase 3 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298890 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [07:02:43] (03PS1) 10Muehlenhoff: Pick a new canary [puppet] - 10https://gerrit.wikimedia.org/r/1299313 [07:03:24] (03CR) 10Fabfur: [C:03+1] Rewrite VarnishHighThreadCount to trigger less [alerts] - 10https://gerrit.wikimedia.org/r/1298909 (owner: 10BCornwall) [07:06:44] Amir1: is it okay for you to deploy "Enable wgNewUserMessageOnFirstEdit on commonswiki" now? or would you rather wait a few more days to see if there are any problems? [07:07:41] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11997498 (10Krd) p:05Medium→03Unbreak! As said, daily-article-l is not manageable for months because the interface... [07:08:05] 06SRE, 10DNS, 06Traffic: authdns-update failing - https://phabricator.wikimedia.org/T428541#11997504 (10elukey) The error is: ` netbox/codfw.wmnet:539 dse-k8s-wdqs2002.codfw.wmnet. A 10.192.47.10 netbox/codfw.wmnet:540 dse-k8s-wdqs2002.codfw.wmnet. AAAA 2620:0:860:125:10:192:46:9 netbox/codfw.wm... [07:08:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:10:22] 06SRE, 10DNS, 06Traffic: authdns-update failing - https://phabricator.wikimedia.org/T428541#11997506 (10elukey) I think the DNS name for https://netbox.wikimedia.org/ipam/ip-addresses/23579/ was 2002 instead of 2001, fixed it. [07:11:13] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1160: Repooling [07:11:21] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1160: Repooling [07:11:26] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1160: Repooling [07:11:32] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1160: Repooling [07:12:24] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11997510 (10ABran-WMF) >>! In T353891#11997498, @Krd wrote: > As said, daily-article-l is not manageable for months be... [07:12:43] !log elukey@cumin1003 START - Cookbook sre.dns.netbox [07:13:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:16:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1160.eqiad.wmnet with reason: Maintenance [07:16:12] (03CR) 10Slyngshede: [C:03+1] Update access medadata lwilson-ctr [puppet] - 10https://gerrit.wikimedia.org/r/1299298 (owner: 10Muehlenhoff) [07:17:02] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11997522 (10Krd) The link loads when not logged in. As soon as you log in, I assume it loads the user list of 30k entr... [07:17:35] (03CR) 10Slyngshede: [C:03+1] admin: Add apdube to analytics-private-datausers [puppet] - 10https://gerrit.wikimedia.org/r/1298924 (https://phabricator.wikimedia.org/T427553) (owner: 10RLazarus) [07:19:24] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix dse-k8s-wdqs2002 duplicate ipv6 address - elukey@cumin1003" [07:19:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix dse-k8s-wdqs2002 duplicate ipv6 address - elukey@cumin1003" [07:19:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:20:48] (03PS1) 10Slyngshede: data.yaml: add mkrolik-wmf as ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/1299327 [07:20:49] (03CR) 10Elukey: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1286411 (https://phabricator.wikimedia.org/T426086) (owner: 10Gerrit maintenance bot) [07:21:17] (03CR) 10Brouberol: [C:03+1] "Nicely done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298327 (https://phabricator.wikimedia.org/T427517) (owner: 10Atsuko) [07:21:17] !log marostegui@dns1004 START - running authdns-update [07:21:49] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:21:57] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:22:10] (03PS1) 10Slyngshede: Update to CAS 7.3.7.2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1299328 [07:22:15] (03CR) 10Brouberol: [C:03+2] CI: add aux-k8s-codfw to the list of environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298283 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [07:22:37] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:39] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:39] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:39] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:39] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:41] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:41] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:41] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:41] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:41] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:41] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:41] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:42] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:43] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:43] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [07:22:47] !log marostegui@dns1004 END - running authdns-update [07:22:49] oh yesss [07:23:19] (03CR) 10Brouberol: [C:03+2] aux-k8s: define the kafka-ui kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1298264 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [07:23:52] (03CR) 10Brouberol: [C:03+2] aux-k8s: define the kafka-ui namespace in both clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298266 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [07:24:02] (03CR) 10Brouberol: [C:03+2] aux-k8s: define the kafka-ui helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298267 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [07:24:11] 06SRE, 10DNS, 06Traffic: authdns-update failing - https://phabricator.wikimedia.org/T428541#11997541 (10elukey) 05Open→03Resolved a:03elukey Confirmed, all good! The fix was https://netbox.wikimedia.org/extras/changelog/280023/ @AlexisJazz it shouldn't be related, it was just a minor DNS update th... [07:24:44] !log fceratto@dns1004 START - running authdns-update [07:25:41] RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [07:25:43] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [07:26:19] !log fceratto@dns1004 END - running authdns-update [07:28:44] jouncebot: now [07:28:44] For the next 0 hour(s) and 31 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T0700) [07:33:50] (03CR) 10Arthur taylor: "Looks good - the properties all look correct and the config should work. Just not sure where those translations are coming from." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298293 (https://phabricator.wikimedia.org/T422935) (owner: 10Lucas Werkmeister (WMDE)) [07:34:37] zabe: o/ if you are around: I noticed a ton of mediawiki errors related to a delete batch job for enwikinews [07:36:01] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:36:02] !log brouberol@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [07:36:36] !log brouberol@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:36:38] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.major-upgrade (exit_code=97) [07:37:32] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1237.eqiad.wmnet with OS trixie [07:37:54] !log brouberol@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [07:38:56] !log brouberol@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [07:39:13] !log brouberol@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-ui: apply [07:41:04] (03CR) 10Muehlenhoff: [C:03+2] Update access medadata lwilson-ctr [puppet] - 10https://gerrit.wikimedia.org/r/1299298 (owner: 10Muehlenhoff) [07:41:30] !log brouberol@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-ui: apply [07:42:46] !log brouberol@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-ui: apply [07:43:32] !log brouberol@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-ui: apply [07:43:44] !log brouberol@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/kafka-ui: apply [07:44:04] !log brouberol@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/kafka-ui: apply [07:45:47] (03CR) 10Brouberol: [C:03+2] dse-k8s-aux: migrate internal kafka-ui disc and svc records to k8s-aux [dns] - 10https://gerrit.wikimedia.org/r/1298262 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [07:45:59] (03PS1) 10Muehlenhoff: Update access medadata for dani [puppet] - 10https://gerrit.wikimedia.org/r/1299334 [07:46:05] !log brouberol@dns1004 START - running authdns-update [07:46:11] (03CR) 10Arthur taylor: [C:03+1] "looks good! Can be deployed when the announcement has been made and enough time has passed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298293 (https://phabricator.wikimedia.org/T422935) (owner: 10Lucas Werkmeister (WMDE)) [07:47:40] !log brouberol@dns1004 END - running authdns-update [07:52:13] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [07:52:56] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1299328 (owner: 10Slyngshede) [07:53:17] (03CR) 10Slyngshede: [V:03+2 C:03+2] Update to CAS 7.3.7.2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1299328 (owner: 10Slyngshede) [07:54:50] (03CR) 10Jelto: "thank you for updating the image !" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298837 (https://phabricator.wikimedia.org/T321316) (owner: 10Urbanecm) [07:59:04] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix typoes - ayounsi@cumin1003" [07:59:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix typoes - ayounsi@cumin1003" [07:59:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:06:58] jouncebot: now [08:06:58] No deployments scheduled for the next 1 hour(s) and 53 minute(s) [08:07:28] folks I will be pooling the docker registry to eqiad and depool it from codfw [08:07:35] hopefully, it will not hurt [08:08:05] !log jiji@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=docker-registry,name=codfw [08:08:15] !log jiji@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=docker-registry,name=eqiad [08:10:08] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11997726 (10ABran-WMF) this is the weird thing, I tried to log in and refresh the page and still get the same snappy r... [08:13:20] (03CR) 10Effie Mouzeli: [C:03+2] docker_registry: replace rdb2009 with rdb2013 [puppet] - 10https://gerrit.wikimedia.org/r/1294279 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [08:14:07] (03PS2) 10Effie Mouzeli: docker_registry: replace rdb2009 with rdb2013 [puppet] - 10https://gerrit.wikimedia.org/r/1294279 (https://phabricator.wikimedia.org/T418924) [08:14:27] (03PS1) 10Muehlenhoff: Record LDAP access for nayoub [puppet] - 10https://gerrit.wikimedia.org/r/1299412 [08:15:26] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:29] (03CR) 10Effie Mouzeli: [C:03+2] docker_registry: replace rdb2009 with rdb2013 [puppet] - 10https://gerrit.wikimedia.org/r/1294279 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [08:16:05] (03CR) 10Harroyo-wmf: [C:03+1] wmf-config: Enable hCaptcha on UploadWizard publish for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298829 (https://phabricator.wikimedia.org/T426126) (owner: 10Mpostoronca) [08:18:27] (03PS1) 10Ayounsi: Sort webrequest_sampled_live dimensions alphabetically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299415 [08:18:39] (03Abandoned) 10Ayounsi: Sort webrequest_sampled_live dimensions alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/1297534 (owner: 10Ayounsi) [08:18:44] (03Abandoned) 10Ayounsi: webrequest_sampled_live: add "kind: number" when relevant [puppet] - 10https://gerrit.wikimedia.org/r/1297539 (owner: 10Ayounsi) [08:20:07] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for nayoub [puppet] - 10https://gerrit.wikimedia.org/r/1299412 (owner: 10Muehlenhoff) [08:21:42] (03PS1) 10Slyngshede: IDP: Upgrade to CAS 7.3.7.2 [dns] - 10https://gerrit.wikimedia.org/r/1299416 [08:21:45] (03Abandoned) 10Matthias Mullie: Add exception for main page [extensions/MultimediaViewer] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298726 (https://phabricator.wikimedia.org/T421019) (owner: 10Marco Fossati) [08:22:10] !log jiji@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=docker-registry,name=eqiad [08:22:17] !log jiji@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=docker-registry,name=codfw [08:22:42] (03PS2) 10Ayounsi: Sort webrequest_sampled_live dimensions alphabetically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299415 [08:26:16] (03PS3) 10Ayounsi: Sort webrequest_sampled_live dimensions alphabetically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299415 [08:32:29] (03PS1) 10Dreamy Jazz: alertmanager: Reroute TSP alerts to PSI alerts channel [puppet] - 10https://gerrit.wikimedia.org/r/1299420 [08:39:34] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:42:22] (03CR) 10Hashar: "What we want for the service:" [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [08:44:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1299327 (owner: 10Slyngshede) [08:45:26] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:26] (03CR) 10Kosta Harlan: [C:03+1] alertmanager: Reroute TSP alerts to PSI alerts channel [puppet] - 10https://gerrit.wikimedia.org/r/1299420 (owner: 10Dreamy Jazz) [08:45:43] (03PS1) 10JavierMonton: stream: mediawiki.user_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299422 (https://phabricator.wikimedia.org/T423952) [08:50:23] !log jiji@cumin1003 conftool action : set/pooled=no; selector: service=docker-registry,name=registry2004.codfw.wmnet [08:50:42] jouncebot: NotASpy [08:50:45] jouncebot: now [08:50:45] No deployments scheduled for the next 1 hour(s) and 9 minute(s) [08:50:58] if anywone is about to deploy on k8s, please hold on [08:55:01] !log jiji@cumin1003 conftool action : set/pooled=yes; selector: service=docker-registry,name=registry2004.codfw.wmnet [08:55:11] !log jiji@cumin1003 conftool action : set/pooled=no; selector: service=docker-registry,name=registry2005.codfw.wmnet [08:55:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:17] (03PS9) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [08:57:43] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1237.eqiad.wmnet with OS trixie [08:58:24] (03PS1) 10Ayounsi: wmf_netflow: specify data kind (bool/number) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299424 [08:59:06] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:59:36] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [09:00:12] (03CR) 10Btullis: [C:03+2] Switch from 4 wdqs namespaces to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298307 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [09:00:43] (03CR) 10Slyngshede: [C:03+1] Update access medadata for dani [puppet] - 10https://gerrit.wikimedia.org/r/1299334 (owner: 10Muehlenhoff) [09:01:27] (03CR) 10Ayounsi: "Not sure what it will really change, but it seems cleaner that way." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299424 (owner: 10Ayounsi) [09:02:50] (03PS2) 10Slyngshede: data.yaml: add mkrolik-wmf as ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/1299327 [09:02:58] (03PS14) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [09:03:11] !log jiji@cumin1003 conftool action : set/pooled=yes; selector: service=docker-registry,name=registry2005.codfw.wmnet [09:03:21] it is sorted [09:04:46] (03CR) 10Muehlenhoff: [C:03+2] Update access medadata for dani [puppet] - 10https://gerrit.wikimedia.org/r/1299334 (owner: 10Muehlenhoff) [09:05:02] (03CR) 10Cathal Mooney: [C:03+1] "I'm not terribly familiar with this syntax but the logic of all the changes looks good to me. No obvious errors." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299424 (owner: 10Ayounsi) [09:06:36] (03Abandoned) 10Cathal Mooney: netops: change dashboard used on the InterfaceDropPercent alert [alerts] - 10https://gerrit.wikimedia.org/r/1297747 (owner: 10Cathal Mooney) [09:06:38] (03CR) 10Muehlenhoff: data.yaml: offboarding hmonroy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1298446 (owner: 10Slyngshede) [09:06:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:07:24] (03CR) 10JMeybohm: [C:04-1] admin_ng/dse-k8s: create opensearch ClusterIssuer (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298327 (https://phabricator.wikimedia.org/T427517) (owner: 10Atsuko) [09:07:35] (03CR) 10Cathal Mooney: [C:03+2] network data.yaml: add new per-rack vlan ranges for eqiad ab refresh [puppet] - 10https://gerrit.wikimedia.org/r/1297685 (https://phabricator.wikimedia.org/T418012) (owner: 10Cathal Mooney) [09:08:21] (03Merged) 10jenkins-bot: Switch from 4 wdqs namespaces to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298307 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [09:10:20] (03CR) 10Slyngshede: [C:03+2] data.yaml: add mkrolik-wmf as ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/1299327 (owner: 10Slyngshede) [09:11:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1043 to es4 eqiad primary T428386', diff saved to https://phabricator.wikimedia.org/P93942 and previous config saved to /var/cache/conftool/dbconfig/20260609-091147-marostegui.json [09:11:53] T428386: Migrate es4 section to Debian Trixie - https://phabricator.wikimedia.org/T428386 [09:11:58] (03PS1) 10Kosta Harlan: hcaptcha: Allow the wikisource.org bare domain in frame-ancestors CSP [puppet] - 10https://gerrit.wikimedia.org/r/1299427 (https://phabricator.wikimedia.org/T428539) [09:11:59] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:12:15] (03PS2) 10JavierMonton: stream: mediawiki.user_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299422 (https://phabricator.wikimedia.org/T423952) [09:12:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1042 to es4 eqiad primary T428386', diff saved to https://phabricator.wikimedia.org/P93943 and previous config saved to /var/cache/conftool/dbconfig/20260609-091215-marostegui.json [09:12:58] (03PS2) 10Slyngshede: data.yaml: offboarding hmonroy [puppet] - 10https://gerrit.wikimedia.org/r/1298446 [09:13:14] (03PS1) 10Marostegui: db1237: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1299428 (https://phabricator.wikimedia.org/T428542) [09:14:14] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:14:17] (03CR) 10Slyngshede: data.yaml: offboarding hmonroy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1298446 (owner: 10Slyngshede) [09:14:27] (03CR) 10Marostegui: [C:03+2] db1237: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1299428 (https://phabricator.wikimedia.org/T428542) (owner: 10Marostegui) [09:14:33] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1043: Upgrading es1043.eqiad.wmnet [09:14:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:14] (03CR) 10Effie Mouzeli: [C:03+2] aliases: swap rdb2007 with rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1294270 (owner: 10Effie Mouzeli) [09:15:57] (03PS1) 10Muehlenhoff: Update access medadata for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1299429 [09:16:19] (03CR) 10CI reject: [V:04-1] Update access medadata for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1299429 (owner: 10Muehlenhoff) [09:16:47] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:17:15] !log fceratto@cumin1003 START - Cookbook sre.mysql.global-read-only [09:17:20] !log fceratto@cumin1003 MariaDB change: Setting sections s5 as read-write [09:17:26] (03PS1) 10Brouberol: global_config: inject network devices data for re-use in Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1299426 (https://phabricator.wikimedia.org/T428553) [09:17:28] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.global-read-only (exit_code=0) [09:17:41] (03PS1) 10Effie Mouzeli: aliases: update all rdb* entries with the new servers [puppet] - 10https://gerrit.wikimedia.org/r/1299430 [09:19:46] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: rdb201[34] implementation tracking - https://phabricator.wikimedia.org/T418924#11998076 (10jijiki) 05In progress→03Resolved [09:20:17] (03CR) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [09:22:07] (03PS3) 10JavierMonton: stream: mediawiki.user_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299422 (https://phabricator.wikimedia.org/T423952) [09:23:43] (03CR) 10Ayounsi: [C:03+1] global_config: inject network devices data for re-use in Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1299426 (https://phabricator.wikimedia.org/T428553) (owner: 10Brouberol) [09:24:21] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [09:25:39] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: inject network devices data for re-use in Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1299426 (https://phabricator.wikimedia.org/T428553) (owner: 10Brouberol) [09:26:28] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [09:27:27] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [09:28:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - other - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:33:10] 07sre-alert-triage, 06ServiceOps new: Alert in need of triage: ProbeDown (instance sophroid:4252) - https://phabricator.wikimedia.org/T428133#11998128 (10MLechvien-WMF) Thanks for the analysis Reuven. @jasmine_ could you take care of fixing the incorrect probing config? [09:34:47] !log Running `mwscript-k8s extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki="commonswiki" --use-jobqueue --poll-sleep=5 --verbose` (after stopping previous scan run) [09:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:09] (03PS1) 10Aklapper: Weekly Phab data for Tech News: Ignore bot authors and access requests [puppet] - 10https://gerrit.wikimedia.org/r/1299434 (https://phabricator.wikimedia.org/T428290) [09:36:48] !log Running `mwscript-k8s extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki="commonswiki" --use-jobqueue --poll-sleep=5 --verbose --last-checked="20260603"` (after stopping previous scan run) [09:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:34] (03PS2) 10Hnowlan: logging: use ECS formatter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1298894 (https://phabricator.wikimedia.org/T368180) [09:41:57] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:43:15] (03CR) 10Atsuko: admin_ng/dse-k8s: create opensearch ClusterIssuer (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298327 (https://phabricator.wikimedia.org/T427517) (owner: 10Atsuko) [09:44:16] (03PS6) 10Atsuko: admin_ng/dse-k8s: create opensearch ClusterIssuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298327 (https://phabricator.wikimedia.org/T427517) [09:44:23] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add routing for liftwing-openapi-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298819 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [09:45:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:46:33] elukey: thanks for the head up! [09:46:38] (03Merged) 10jenkins-bot: rest-gateway: Add routing for liftwing-openapi-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298819 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [09:47:51] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:48:01] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:48:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - other - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:48:20] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool es1043: Upgrading es1043.eqiad.wmnet [09:49:21] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS trixie [09:49:48] (03PS1) 10Kamila Součková: shellbox: revert score image version to bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299436 (https://phabricator.wikimedia.org/T427820) [09:50:13] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:50:31] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:51:20] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:51:37] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:52:51] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11998210 (10ABran-WMF) == **TL;DR** (generated) == - **Root cause:** when logged in, Postorius' `ListSummaryView` fet... [09:53:04] (03CR) 10Hashar: "recheck after having deployed the CI image and switched the job to it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298799 (https://phabricator.wikimedia.org/T427069) (owner: 10JMeybohm) [09:53:31] (03CR) 10Cathal Mooney: [C:03+2] Rancid: add config backup for missing leaf lsw1-d3-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1298871 (owner: 10Cathal Mooney) [09:53:51] (03CR) 10Kamila Součková: [C:03+2] shellbox: revert score image version to bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299436 (https://phabricator.wikimedia.org/T427820) (owner: 10Kamila Součková) [09:54:13] (03PS1) 10JavierMonton: topic: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299437 (https://phabricator.wikimedia.org/T427925) [09:55:14] (03CR) 10Reedy: [C:04-2] "not needed, I already fixed score?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299436 (https://phabricator.wikimedia.org/T427820) (owner: 10Kamila Součková) [09:57:57] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1160: Repooling [09:57:58] (03CR) 10Gmodena: dse-k8s-services: WDQS deployment helmfile values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [09:58:24] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1298446 (owner: 10Slyngshede) [09:58:58] (03CR) 10JavierMonton: [C:03+2] topic: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299437 (https://phabricator.wikimedia.org/T427925) (owner: 10JavierMonton) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1000) [10:01:10] (03CR) 10Gmodena: [C:03+1] wdqs-backend: Deployment chart for the WDQS triple-store (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [10:01:22] (03Merged) 10jenkins-bot: topic: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299437 (https://phabricator.wikimedia.org/T427925) (owner: 10JavierMonton) [10:01:22] (03PS1) 10Effie Mouzeli: site.pp: switch rdb2007-2010 to inactive [puppet] - 10https://gerrit.wikimedia.org/r/1299438 (https://phabricator.wikimedia.org/T428561) [10:03:42] (03PS2) 10Muehlenhoff: Update access medadata for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1299429 [10:04:07] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:04:11] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:04:13] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:04:15] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:04:23] (03CR) 10Hnowlan: "Sample output:" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1298894 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [10:04:40] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:04:43] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1043.eqiad.wmnet with reason: host reimage [10:04:49] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:06:36] (03PS1) 10Clément Goubert: rest-gateway: fix lw-openapi-server host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299439 (https://phabricator.wikimedia.org/T427902) [10:06:52] (03CR) 10CI reject: [V:04-1] rest-gateway: fix lw-openapi-server host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299439 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [10:07:20] (03PS2) 10Clément Goubert: rest-gateway: fix lw-openapi-server host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299439 (https://phabricator.wikimedia.org/T427902) [10:08:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1043.eqiad.wmnet with reason: host reimage [10:08:34] (03CR) 10Slyngshede: [C:03+2] data.yaml: offboarding hmonroy [puppet] - 10https://gerrit.wikimedia.org/r/1298446 (owner: 10Slyngshede) [10:08:45] (03CR) 10Trueg: dse-k8s-services: WDQS deployment helmfile values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [10:09:02] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir3006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Ncredir [10:09:02] PROBLEM - HTTPS non-canonical-redirect-37 on ncredir3006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Ncredir [10:09:04] PROBLEM - HTTPS non-canonical-redirect-14 on ncredir3006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Ncredir [10:09:04] PROBLEM - HTTPS non-canonical-redirect-25 on ncredir3006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Ncredir [10:09:04] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir3006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Ncredir [10:09:57] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: fix lw-openapi-server host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299439 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [10:10:02] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir3006 is OK: SSL OK - Certificate *.wikipedia.bg valid until 2026-07-31 22:40:27 +0000 (expires in 52 days) https://wikitech.wikimedia.org/wiki/Ncredir [10:10:02] RECOVERY - HTTPS non-canonical-redirect-37 on ncredir3006 is OK: SSL OK - Certificate wikipedia.support valid until 2026-08-16 17:25:14 +0000 (expires in 68 days) https://wikitech.wikimedia.org/wiki/Ncredir [10:10:04] RECOVERY - HTTPS non-canonical-redirect-14 on ncredir3006 is OK: SSL OK - Certificate indwikipedia.in valid until 2026-08-26 03:18:04 +0000 (expires in 77 days) https://wikitech.wikimedia.org/wiki/Ncredir [10:10:04] RECOVERY - HTTPS non-canonical-redirect-25 on ncredir3006 is OK: SSL OK - Certificate wikilambda.net valid until 2026-07-17 16:56:54 +0000 (expires in 38 days) https://wikitech.wikimedia.org/wiki/Ncredir [10:11:02] RECOVERY - HTTPS non-canonical-redirect-9 on ncredir3006 is OK: SSL OK - Certificate wikipediashop.com valid until 2026-07-17 15:56:54 +0000 (expires in 38 days) https://wikitech.wikimedia.org/wiki/Ncredir [10:11:31] (03PS1) 10Cathal Mooney: Add wikikube-ctrl2004 and wikikube-ctrl2005 to codfw K8S NS entries [dns] - 10https://gerrit.wikimedia.org/r/1299440 [10:12:36] (03PS21) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [10:12:38] (03Merged) 10jenkins-bot: rest-gateway: fix lw-openapi-server host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299439 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [10:12:51] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:12:54] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:12:56] (03PS2) 10Cathal Mooney: Add wikikube-ctrl2004 and wikikube-ctrl2005 to codfw K8S NS entries [dns] - 10https://gerrit.wikimedia.org/r/1299440 [10:13:06] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:13:14] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:13:20] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:13:32] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:13:36] (03PS1) 10Hnowlan: images._error: write message to body in error case [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1299441 (https://phabricator.wikimedia.org/T417577) [10:13:39] (03PS3) 10Slyngshede: Update access metadata for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1299429 (owner: 10Muehlenhoff) [10:13:42] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:13:46] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:14:44] (03CR) 10CI reject: [V:04-1] wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [10:15:59] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:16:02] (03CR) 10Slyngshede: [C:03+1] Update access metadata for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1299429 (owner: 10Muehlenhoff) [10:16:05] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:16:10] (03PS1) 10Brouberol: turnilo: inject the wmf_netflow mappings from netbox data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) [10:17:02] !log complete rollout of apache2 upgrades [10:17:03] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:18] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:17:25] (03PS2) 10Brouberol: turnilo: inject the wmf_netflow mappings from netbox data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) [10:17:27] (03PS1) 10Muehlenhoff: kafka::broker: Stop using the transition package [puppet] - 10https://gerrit.wikimedia.org/r/1299443 [10:17:40] (03CR) 10Muehlenhoff: [C:03+2] Update access metadata for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1299429 (owner: 10Muehlenhoff) [10:18:48] (03CR) 10Gmodena: [C:03+1] dse-k8s-services: WDQS deployment helmfile values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [10:19:44] (03CR) 10Elukey: [C:03+1] kafka::broker: Stop using the transition package [puppet] - 10https://gerrit.wikimedia.org/r/1299443 (owner: 10Muehlenhoff) [10:19:46] (03PS10) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) [10:20:22] (03Abandoned) 10Kamila Součková: shellbox: revert score image version to bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299436 (https://phabricator.wikimedia.org/T427820) (owner: 10Kamila Součková) [10:20:26] FIRING: [7x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:20:34] (03PS8) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) [10:21:54] (03PS2) 10Effie Mouzeli: site.pp: switch rdb2007-2010 to inactive [puppet] - 10https://gerrit.wikimedia.org/r/1299438 (https://phabricator.wikimedia.org/T428561) [10:22:20] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1299438 (https://phabricator.wikimedia.org/T428561) (owner: 10Effie Mouzeli) [10:24:16] (03PS22) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [10:25:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [10:25:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [10:26:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1043.eqiad.wmnet with OS trixie [10:26:45] (03PS1) 10Effie Mouzeli: mediawiki-common: remove old rdb servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299447 (https://phabricator.wikimedia.org/T428561) [10:26:52] (03CR) 10Muehlenhoff: [C:03+1] data.yaml: offboarding dmaza [puppet] - 10https://gerrit.wikimedia.org/r/1298450 (owner: 10Slyngshede) [10:27:25] (03CR) 10Brouberol: "The CI does not display the "real" diff, due to the netbox data not being propagated in CI. I ran `helmfile diff` in prod, and the diff is" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) (owner: 10Brouberol) [10:28:02] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir3006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Ncredir [10:28:04] PROBLEM - HTTPS non-canonical-redirect-25 on ncredir3006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Ncredir [10:28:18] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1160: Repooling [10:28:38] (03CR) 10Cathal Mooney: "Nice work! Thanks for the rapid turnaround on this, I'm quite impressed! Two minor nits in line, tbh I'm out of my depth so perhaps they" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) (owner: 10Brouberol) [10:28:51] (03CR) 10Btullis: [C:03+1] "Looks good to me. Feel free to merge." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [10:29:02] RECOVERY - HTTPS non-canonical-redirect-4 on ncredir3006 is OK: SSL OK - Certificate *.wikispecies.net valid until 2026-07-21 02:58:22 +0000 (expires in 41 days) https://wikitech.wikimedia.org/wiki/Ncredir [10:29:04] RECOVERY - HTTPS non-canonical-redirect-25 on ncredir3006 is OK: SSL OK - Certificate wikilambda.net valid until 2026-07-17 16:56:54 +0000 (expires in 38 days) https://wikitech.wikimedia.org/wiki/Ncredir [10:29:15] (03PS1) 10Clément Goubert: vopsbot: Invite sirenbot to mw_security [puppet] - 10https://gerrit.wikimedia.org/r/1299449 [10:29:16] (03PS2) 10Slyngshede: data.yaml: offboarding dmaza [puppet] - 10https://gerrit.wikimedia.org/r/1298450 [10:29:53] (03CR) 10Atsuko: [C:03+1] "+1! Thanks for attaching the test plan!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) (owner: 10Brouberol) [10:29:58] (03CR) 10Gmodena: [C:03+1] wdqs-backend: Deployment chart for the WDQS triple-store (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [10:30:19] (03CR) 10MVernon: [C:03+1] "Would be nice to be able to drive sirenbot and cortobot in the same channel :)" [puppet] - 10https://gerrit.wikimedia.org/r/1299449 (owner: 10Clément Goubert) [10:30:28] (03CR) 10Clément Goubert: [C:03+2] vopsbot: Invite sirenbot to mw_security [puppet] - 10https://gerrit.wikimedia.org/r/1299449 (owner: 10Clément Goubert) [10:30:38] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1201 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 UGood : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [10:30:40] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1201 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 UGood : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T428571 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [10:30:47] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1201 - https://phabricator.wikimedia.org/T428571 (10ops-monitoring-bot) 03NEW [10:30:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [10:30:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [10:31:08] (03CR) 10Slyngshede: [C:03+2] data.yaml: offboarding dmaza [puppet] - 10https://gerrit.wikimedia.org/r/1298450 (owner: 10Slyngshede) [10:31:40] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [10:32:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1043: repool after upgrade [10:33:18] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1201 - https://phabricator.wikimedia.org/T428571#11998560 (10Jclark-ctr) a:03Jclark-ctr [10:34:39] (03PS1) 10Arthur taylor: WikiProjects links - add statement-based link to project on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299451 (https://phabricator.wikimedia.org/T423144) [10:35:07] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:35:08] (03PS2) 10Arthur taylor: WikiProjects links - add statement-based link to project on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299451 (https://phabricator.wikimedia.org/T423144) [10:35:11] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:35:37] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:35:46] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [10:38:26] (03CR) 10Clément Goubert: [C:03+1] site.pp: switch rdb2007-2010 to inactive [puppet] - 10https://gerrit.wikimedia.org/r/1299438 (https://phabricator.wikimedia.org/T428561) (owner: 10Effie Mouzeli) [10:42:36] (03PS3) 10Brouberol: turnilo: inject the wmf_netflow mappings from netbox data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) [10:43:21] (03PS3) 10Tiziano Fogli: slothslos/report2drive: add modules [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) [10:43:21] (03PS5) 10Tiziano Fogli: slothslos/report2drive: add profiles [puppet] - 10https://gerrit.wikimedia.org/r/1298295 (https://phabricator.wikimedia.org/T425795) [10:43:21] (03PS5) 10Tiziano Fogli: slothslos/report2drive: instantiate resources [puppet] - 10https://gerrit.wikimedia.org/r/1298296 (https://phabricator.wikimedia.org/T425795) [10:43:22] (03PS5) 10Tiziano Fogli: slothslos/report2drive: add Hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/1298297 (https://phabricator.wikimedia.org/T425795) [10:43:23] (03PS5) 10Tiziano Fogli: slothslos/report2drive: enable deep merge for vars [puppet] - 10https://gerrit.wikimedia.org/r/1298298 (https://phabricator.wikimedia.org/T425795) [10:43:55] (03CR) 10CI reject: [V:04-1] slothslos/report2drive: add modules [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [10:44:11] (03CR) 10JMeybohm: [C:03+1] "CI related change LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298327 (https://phabricator.wikimedia.org/T427517) (owner: 10Atsuko) [10:44:41] (03CR) 10JMeybohm: [C:03+2] Bump kubeconform checks to 1.34.8, remove 1.23.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298799 (https://phabricator.wikimedia.org/T427069) (owner: 10JMeybohm) [10:45:34] (03PS4) 10Tiziano Fogli: slothslos/report2drive: add modules [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) [10:45:34] (03PS6) 10Tiziano Fogli: slothslos/report2drive: add profiles [puppet] - 10https://gerrit.wikimedia.org/r/1298295 (https://phabricator.wikimedia.org/T425795) [10:45:34] (03PS6) 10Tiziano Fogli: slothslos/report2drive: instantiate resources [puppet] - 10https://gerrit.wikimedia.org/r/1298296 (https://phabricator.wikimedia.org/T425795) [10:45:35] (03PS6) 10Tiziano Fogli: slothslos/report2drive: add Hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/1298297 (https://phabricator.wikimedia.org/T425795) [10:45:36] (03PS6) 10Tiziano Fogli: slothslos/report2drive: enable deep merge for vars [puppet] - 10https://gerrit.wikimedia.org/r/1298298 (https://phabricator.wikimedia.org/T425795) [10:46:12] (03CR) 10CI reject: [V:04-1] slothslos/report2drive: add modules [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [10:46:13] (03CR) 10Brouberol: turnilo: inject the wmf_netflow mappings from netbox data (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) (owner: 10Brouberol) [10:47:07] (03CR) 10Ayounsi: "Great, thanks a lot !!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) (owner: 10Brouberol) [10:47:11] (03PS5) 10Tiziano Fogli: slothslos/report2drive: add modules [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) [10:47:11] (03PS7) 10Tiziano Fogli: slothslos/report2drive: add profiles [puppet] - 10https://gerrit.wikimedia.org/r/1298295 (https://phabricator.wikimedia.org/T425795) [10:47:11] (03PS7) 10Tiziano Fogli: slothslos/report2drive: instantiate resources [puppet] - 10https://gerrit.wikimedia.org/r/1298296 (https://phabricator.wikimedia.org/T425795) [10:47:13] (03PS7) 10Tiziano Fogli: slothslos/report2drive: add Hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/1298297 (https://phabricator.wikimedia.org/T425795) [10:47:17] (03PS7) 10Tiziano Fogli: slothslos/report2drive: enable deep merge for vars [puppet] - 10https://gerrit.wikimedia.org/r/1298298 (https://phabricator.wikimedia.org/T425795) [10:48:01] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: switch rdb2007-2010 to inactive [puppet] - 10https://gerrit.wikimedia.org/r/1299438 (https://phabricator.wikimedia.org/T428561) (owner: 10Effie Mouzeli) [10:49:03] (03CR) 10Effie Mouzeli: [C:03+2] aliases: update all rdb* entries with the new servers [puppet] - 10https://gerrit.wikimedia.org/r/1299430 (owner: 10Effie Mouzeli) [10:49:40] (03CR) 10Atsuko: [C:03+2] admin_ng/dse-k8s: create opensearch ClusterIssuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298327 (https://phabricator.wikimedia.org/T427517) (owner: 10Atsuko) [10:51:04] (03CR) 10Clément Goubert: "Present in `helmfile.d/services/mw-debug/values.yaml` as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299447 (https://phabricator.wikimedia.org/T428561) (owner: 10Effie Mouzeli) [10:51:41] (03PS2) 10Effie Mouzeli: alias.yaml: retire the old codfw redis servers [puppet] - 10https://gerrit.wikimedia.org/r/1297124 (https://phabricator.wikimedia.org/T419976) [10:51:58] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [10:52:01] (03Abandoned) 10Effie Mouzeli: alias.yaml: retire the old codfw redis servers [puppet] - 10https://gerrit.wikimedia.org/r/1297124 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [10:52:05] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [10:52:08] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:52:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:52:24] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [10:52:32] (03PS2) 10Muehlenhoff: Pick a new canary [puppet] - 10https://gerrit.wikimedia.org/r/1299313 [10:52:38] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [10:52:45] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:52:47] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [10:53:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2046: Upgrading es2046.codfw.wmnet [10:53:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2046: Upgrading es2046.codfw.wmnet [10:55:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2046.codfw.wmnet with OS trixie [10:57:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:01:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:02:55] (03PS1) 10Gkyziridis: wgRestSandboxSpecs: Add lift-wing spec pointing to api.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299454 (https://phabricator.wikimedia.org/T427902) [11:04:31] FIRING: [5x] RedisReplicaDown: Redis replica down rdb2010:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [11:06:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:06:25] (03PS1) 10Effie Mouzeli: site.pp: reimage rdb1015 and rdb1016 as redis servers [puppet] - 10https://gerrit.wikimedia.org/r/1299455 (https://phabricator.wikimedia.org/T418918) [11:08:10] 06SRE, 10ServiceOps-Upgrades-Hardware, 13Patch-For-Review, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#11998679 (10jijiki) [11:08:46] (03PS6) 10Tiziano Fogli: slothslos/report2drive: add modules [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) [11:08:46] (03PS8) 10Tiziano Fogli: slothslos/report2drive: add profiles [puppet] - 10https://gerrit.wikimedia.org/r/1298295 (https://phabricator.wikimedia.org/T425795) [11:08:46] (03PS8) 10Tiziano Fogli: slothslos/report2drive: instantiate resources [puppet] - 10https://gerrit.wikimedia.org/r/1298296 (https://phabricator.wikimedia.org/T425795) [11:08:47] (03PS8) 10Tiziano Fogli: slothslos/report2drive: add Hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/1298297 (https://phabricator.wikimedia.org/T425795) [11:08:48] (03PS8) 10Tiziano Fogli: slothslos/report2drive: enable deep merge for vars [puppet] - 10https://gerrit.wikimedia.org/r/1298298 (https://phabricator.wikimedia.org/T425795) [11:09:14] (03CR) 10Tiziano Fogli: slothslos/report2drive: add modules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [11:09:31] FIRING: [10x] RedisReplicaDown: Redis replica down rdb2008:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [11:10:49] (03PS1) 10Muehlenhoff: Remove access for ksiebert [puppet] - 10https://gerrit.wikimedia.org/r/1299456 [11:11:40] (03CR) 10CI reject: [V:04-1] Remove access for ksiebert [puppet] - 10https://gerrit.wikimedia.org/r/1299456 (owner: 10Muehlenhoff) [11:11:59] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2046.codfw.wmnet with reason: host reimage [11:12:06] (03PS2) 10Effie Mouzeli: site.pp: reimage rdb1015 and rdb1016 as redis servers [puppet] - 10https://gerrit.wikimedia.org/r/1299455 (https://phabricator.wikimedia.org/T418918) [11:12:24] (03PS3) 10Effie Mouzeli: site.pp: reimage rdb1015 and rdb1016 as redis servers [puppet] - 10https://gerrit.wikimedia.org/r/1299455 (https://phabricator.wikimedia.org/T418918) [11:13:06] (03CR) 10Effie Mouzeli: [C:03+2] Pick a new canary [puppet] - 10https://gerrit.wikimedia.org/r/1299313 (owner: 10Muehlenhoff) [11:13:19] (03PS2) 10Muehlenhoff: Remove access for ksiebert [puppet] - 10https://gerrit.wikimedia.org/r/1299456 [11:15:53] (03Merged) 10jenkins-bot: Bump kubeconform checks to 1.34.8, remove 1.23.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298799 (https://phabricator.wikimedia.org/T427069) (owner: 10JMeybohm) [11:17:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1043: repool after upgrade [11:17:34] (03CR) 10Muehlenhoff: [C:03+1] "Smoketest von 2005 and idp-test.w.o was all fine" [dns] - 10https://gerrit.wikimedia.org/r/1299416 (owner: 10Slyngshede) [11:18:25] (03PS1) 10Slyngshede: Permissions: Approvals are incorrectly compared [software/bitu] - 10https://gerrit.wikimedia.org/r/1299457 [11:18:34] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2046.codfw.wmnet with reason: host reimage [11:18:54] (03PS1) 10Effie Mouzeli: changeprop: switch to rdb1015 (staging) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299458 (https://phabricator.wikimedia.org/T418918) [11:19:29] (03CR) 10Slyngshede: [C:03+1] Remove access for ksiebert [puppet] - 10https://gerrit.wikimedia.org/r/1299456 (owner: 10Muehlenhoff) [11:19:31] FIRING: [10x] RedisReplicaDown: Redis replica down rdb2008:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [11:19:40] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Rotating production SSH-Key for @Michael to a Yubikey-based one - https://phabricator.wikimedia.org/T428037#11998718 (10Raine) Key verified OOB. [11:20:04] (03CR) 10Kamila Součková: [C:03+2] "Yes, verified." [puppet] - 10https://gerrit.wikimedia.org/r/1297191 (https://phabricator.wikimedia.org/T428037) (owner: 10Kamila Součková) [11:20:11] (03Merged) 10jenkins-bot: admin_ng/dse-k8s: create opensearch ClusterIssuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298327 (https://phabricator.wikimedia.org/T427517) (owner: 10Atsuko) [11:20:26] FIRING: [7x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:22:29] (03PS1) 10Effie Mouzeli: redioscope: switch to rdb1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299459 (https://phabricator.wikimedia.org/T418918) [11:23:32] (03CR) 10Mahmoud-abdelsattar: [C:03+1] WikiProjects links - add statement-based link to project on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299451 (https://phabricator.wikimedia.org/T423144) (owner: 10Arthur taylor) [11:23:37] (03CR) 10Slyngshede: [C:03+2] IDP: Upgrade to CAS 7.3.7.2 [dns] - 10https://gerrit.wikimedia.org/r/1299416 (owner: 10Slyngshede) [11:23:54] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1201 - https://phabricator.wikimedia.org/T428571#11998741 (10Jclark-ctr) Dell SR227538194 @BTullis we have 2 failed drives 1 listed as foreign 1 failed Physical Disk 0:1:3 Physical Disk 0:1:7 Can you assist with replacements [11:24:13] (03PS1) 10Effie Mouzeli: changeprop: switch to rdb1015 (staging) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299460 (https://phabricator.wikimedia.org/T418918) [11:24:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Degraded RAID on an-worker1201 - https://phabricator.wikimedia.org/T428571#11998744 (10Jclark-ctr) [11:24:28] (03CR) 10Muehlenhoff: [C:03+2] Remove access for ksiebert [puppet] - 10https://gerrit.wikimedia.org/r/1299456 (owner: 10Muehlenhoff) [11:24:31] RESOLVED: [10x] RedisReplicaDown: Redis replica down rdb2008:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [11:24:40] !log slyngshede@dns1004 START - running authdns-update [11:25:26] (03PS1) 10Effie Mouzeli: changeprop-jobqueue: switch to rdb1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299462 (https://phabricator.wikimedia.org/T418918) [11:26:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Rotating production SSH-Key for @Michael to a Yubikey-based one - https://phabricator.wikimedia.org/T428037#11998778 (10Raine) 05Open→03Resolved a:03Raine [11:26:16] !log slyngshede@dns1004 END - running authdns-update [11:26:53] !log CAS-SSO upgrade to version 3.7.3.2 [11:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:03] !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging KSiebert out of all services on: 2435 hosts [11:30:44] (03CR) 10Blake: [C:03+1] site.pp: reimage rdb1015 and rdb1016 as redis servers [puppet] - 10https://gerrit.wikimedia.org/r/1299455 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [11:30:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299451 (https://phabricator.wikimedia.org/T423144) (owner: 10Arthur taylor) [11:31:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1299443 (owner: 10Muehlenhoff) [11:31:46] !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging HMonroy out of all services on: 2435 hosts [11:31:46] (03CR) 10Effie Mouzeli: "I am waiting to reimage rdb1015 and rdb1016 (and get new IPs), so to update this file with one go. E_TOO_MANY_COMMITS." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299447 (https://phabricator.wikimedia.org/T428561) (owner: 10Effie Mouzeli) [11:33:21] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: reimage rdb1015 and rdb1016 as redis servers [puppet] - 10https://gerrit.wikimedia.org/r/1299455 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [11:34:00] (03CR) 10Clément Goubert: [C:03+1] mediawiki-common: remove old rdb servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299447 (https://phabricator.wikimedia.org/T428561) (owner: 10Effie Mouzeli) [11:35:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2185.codfw.wmnet with reason: Reimage [11:36:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2046.codfw.wmnet with OS trixie [11:36:45] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:37:38] (03PS1) 10Effie Mouzeli: ratelimit: switch to rdb1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299464 (https://phabricator.wikimedia.org/T418918) [11:38:15] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [11:38:59] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2046: repool after maintenance [11:39:10] (03CR) 10Trueg: [C:03+2] wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [11:39:53] fceratto@cumin1003 major-upgrade (PID 2327519) is awaiting input [11:39:54] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 not rebooting - https://phabricator.wikimedia.org/T428542#11998888 (10Jclark-ctr) @Marostegui if you want to leave ticket open at least till Dell responds. But server is back up right now after updating firmwares Same issue that was documented in T398794 tic... [11:41:31] (03Merged) 10jenkins-bot: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [11:43:07] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:43:41] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:44:22] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [11:45:20] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:45:48] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: hw troubleshooting: CPU1 thermal fault for wdqs1015.eqiad.wmnet - https://phabricator.wikimedia.org/T427852#11998911 (10Jclark-ctr) 05Open→03Resolved Closing this ticket Opened Decom ticket T428582 for Data Platform [11:46:26] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 not rebooting - https://phabricator.wikimedia.org/T428542#11998915 (10Marostegui) @Jclark-ctr yeah - let's leave it open. The good thing is that the server still has a couple of month of warranty if we can get something replaced by dell. [11:46:50] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-d1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T428361#11998918 (10Jclark-ctr) 05Open→03Resolved no Faults for the last 2 days resolving ticket [11:47:25] !log atsuko@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:47:35] (03PS1) 10Effie Mouzeli: rest-gateway: switch to rdb1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299466 (https://phabricator.wikimedia.org/T418918) [11:47:49] (03PS10) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [11:48:29] !log atsuko@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:48:50] (03PS1) 10Effie Mouzeli: docker-registry: switch to rdb1015 [puppet] - 10https://gerrit.wikimedia.org/r/1299467 (https://phabricator.wikimedia.org/T418918) [11:49:08] !log atsuko@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:50:24] marostegui@cumin1003 reimage (PID 2334380) is awaiting input [11:51:00] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1237.eqiad.wmnet with OS trixie [11:51:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 not rebooting - https://phabricator.wikimedia.org/T428542#11998929 (10Jclark-ctr) Yeah, unfortunately we haven't had any luck obtaining replacement parts. Some of the earlier failures may not have had Dell support tickets submitted when this issue first occurred,... [11:51:18] !log atsuko@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:52:14] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 not rebooting - https://phabricator.wikimedia.org/T428542#11998933 (10Marostegui) Ok, I am doing a reimage and will put it back in production once it finishes. Let's see if Dell says something before closing the ticket if that's ok with you. Thank you! [11:53:35] 06SRE, 10ServiceOps-Upgrades-Hardware, 13Patch-For-Review, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#11998935 (10jijiki) [11:54:06] (03PS1) 10Effie Mouzeli: ProductionServices.php: switch filebackend.php to rdb2015:6381 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299468 (https://phabricator.wikimedia.org/T418918) [11:56:04] (03PS1) 10Muehlenhoff: Update access metadata for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1299469 [11:56:31] (03CR) 10Dreamy Jazz: wmf-config: Enable hCaptcha on UploadWizard publish for testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298829 (https://phabricator.wikimedia.org/T426126) (owner: 10Mpostoronca) [11:56:34] (03CR) 10Brouberol: turnilo: inject the wmf_netflow mappings from netbox data (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) (owner: 10Brouberol) [11:56:51] jouncebot: nowandnext [11:56:52] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [11:56:52] In 0 hour(s) and 3 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1200) [11:57:57] (03PS2) 10Effie Mouzeli: mediawiki-common: remove old rdb servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299447 (https://phabricator.wikimedia.org/T428561) [11:58:18] (03PS4) 10Dreamy Jazz: wmf-config: Enable hCaptcha on UploadWizard publish for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298829 (https://phabricator.wikimedia.org/T426126) (owner: 10Mpostoronca) [11:58:40] (03PS2) 10Effie Mouzeli: mediawiki-common: remove old rdb servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299447 (https://phabricator.wikimedia.org/T428561) [11:58:50] (03CR) 10Dreamy Jazz: wmf-config: Enable hCaptcha on UploadWizard publish for testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298829 (https://phabricator.wikimedia.org/T426126) (owner: 10Mpostoronca) [11:59:00] (03CR) 10Dreamy Jazz: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298829 (https://phabricator.wikimedia.org/T426126) (owner: 10Mpostoronca) [11:59:51] (03PS1) 10Effie Mouzeli: mediawiki-common: add rdb1015 rdb1016 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299470 (https://phabricator.wikimedia.org/T418918) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1200) [12:00:39] !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Dmaza out of all services on: 2435 hosts [12:00:43] (03PS4) 10Brouberol: turnilo: inject the wmf_netflow mappings from netbox data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) [12:00:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298829 (https://phabricator.wikimedia.org/T426126) (owner: 10Mpostoronca) [12:01:08] (03CR) 10Brouberol: turnilo: inject the wmf_netflow mappings from netbox data (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) (owner: 10Brouberol) [12:03:26] (03CR) 10CI reject: [V:04-1] mediawiki-common: add rdb1015 rdb1016 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299470 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [12:04:54] (03CR) 10Slyngshede: [C:03+1] Update access metadata for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1299469 (owner: 10Muehlenhoff) [12:05:06] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1237.eqiad.wmnet with reason: host reimage [12:05:47] (03Merged) 10jenkins-bot: wmf-config: Enable hCaptcha on UploadWizard publish for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298829 (https://phabricator.wikimedia.org/T426126) (owner: 10Mpostoronca) [12:07:27] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1298829|wmf-config: Enable hCaptcha on UploadWizard publish for testwiki (T426126)]] [12:07:31] T426126: Implement CAPTCHA support in the Upload Wizard - https://phabricator.wikimedia.org/T426126 [12:07:32] (03CR) 10Muehlenhoff: [C:03+2] Update access metadata for dtotten-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1299469 (owner: 10Muehlenhoff) [12:07:58] (03PS1) 10Ayounsi: Add rack depool strategy for role aux_k8s::worker [puppet] - 10https://gerrit.wikimedia.org/r/1299471 (https://phabricator.wikimedia.org/T327300) [12:08:37] (03PS1) 10Kevin Bazira: aptrepo: add ROCm 7.2 packages to wikimedia bookworm mirror [puppet] - 10https://gerrit.wikimedia.org/r/1299472 (https://phabricator.wikimedia.org/T428577) [12:10:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1237.eqiad.wmnet with reason: host reimage [12:12:16] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, and 2 others: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11999051 (10Ladsgroup) Thanks for the thorough investigation! I want to clean them up until it becomes automated but th... [12:12:24] Testserver deployment is going very slowly [12:13:28] !log dreamyjazz@deploy1003 mpostoronca, dreamyjazz: Backport for [[gerrit:1298829|wmf-config: Enable hCaptcha on UploadWizard publish for testwiki (T426126)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:13:32] T426126: Implement CAPTCHA support in the Upload Wizard - https://phabricator.wikimedia.org/T426126 [12:14:01] !log ayounsi@cumin1003 START - Cookbook sre.network.depool-rack with action 'depool' for codfw rack A4 [12:14:44] (03CR) 10Ayounsi: [C:03+1] "nice, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) (owner: 10Brouberol) [12:15:23] !log drain traffic on ssw1-a1-codfw - add gshut community in evpn underlay - T427357 [12:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:31] T427357: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357 [12:16:57] !log dreamyjazz@deploy1003 mpostoronca, dreamyjazz: Continuing with deployment [12:16:58] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 24 hosts with reason: Rack A4 depool [12:17:12] (03CR) 10Brouberol: [C:03+2] turnilo: inject the wmf_netflow mappings from netbox data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299442 (https://phabricator.wikimedia.org/T428553) (owner: 10Brouberol) [12:17:18] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::kubeadm::etcd: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1295905 (https://phabricator.wikimedia.org/T427799) (owner: 10Majavah) [12:17:18] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.depool-rack (exit_code=99) with action 'depool' for codfw rack A4 [12:17:54] (03PS1) 10Giuseppe Lavagetto: hiddenparma: switch to native CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/1299475 (https://phabricator.wikimedia.org/T422235) [12:17:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298654 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [12:19:13] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host rdb1015.eqiad.wmnet with OS trixie [12:19:24] 06SRE, 10ServiceOps-Upgrades-Hardware, 13Patch-For-Review, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#11999065 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1003 for host rdb1015.eqiad.wmnet with OS... [12:19:31] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host rdb1016.eqiad.wmnet with OS trixie [12:19:34] !log jiji@cumin1003 START - Cookbook sre.hosts.move-vlan for host rdb1015 [12:19:34] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host rdb1015 [12:19:45] 06SRE, 10ServiceOps-Upgrades-Hardware, 13Patch-For-Review, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#11999066 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1003 for host rdb1016.eqiad.wmnet with OS... [12:19:51] !log jiji@cumin1003 START - Cookbook sre.hosts.move-vlan for host rdb1016 [12:19:51] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host rdb1016 [12:19:58] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, and 2 others: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11999067 (10ABran-WMF) Thanks @Ladsgroup, per our chat on IRC I'll run the [[ https://gerrit.wikimedia.org/r/c/operatio... [12:20:01] !log ayounsi@cumin1003 START - Cookbook sre.mysql.depool depool db2241: rack depool [12:20:01] (03CR) 10CI reject: [V:04-1] hiddenparma: switch to native CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/1299475 (https://phabricator.wikimedia.org/T422235) (owner: 10Giuseppe Lavagetto) [12:20:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [12:20:19] (03PS11) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [12:20:21] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2241: rack depool [12:20:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [12:20:46] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2251-2253].codfw.wmnet [12:22:00] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2005.codfw.wmnet [12:22:32] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2251-2253].codfw.wmnet [12:23:06] (03PS2) 10Giuseppe Lavagetto: hiddenparma: switch to native CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/1299475 (https://phabricator.wikimedia.org/T422235) [12:23:27] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-worker2006.codfw.wmnet [12:23:31] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1298829|wmf-config: Enable hCaptcha on UploadWizard publish for testwiki (T426126)]] (duration: 16m 04s) [12:23:35] T426126: Implement CAPTCHA support in the Upload Wizard - https://phabricator.wikimedia.org/T426126 [12:23:42] (03PS1) 10Dbrant: hCaptcha: Roll out to all wikis for api account creation. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299477 (https://phabricator.wikimedia.org/T426050) [12:24:04] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-worker2006.codfw.wmnet [12:24:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2046: repool after maintenance [12:24:52] (03CR) 10Dpogorzelski: [C:03+2] aptrepo: add ROCm 7.2 packages to wikimedia bookworm mirror [puppet] - 10https://gerrit.wikimedia.org/r/1299472 (https://phabricator.wikimedia.org/T428577) (owner: 10Kevin Bazira) [12:25:14] (03CR) 10CI reject: [V:04-1] hiddenparma: switch to native CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/1299475 (https://phabricator.wikimedia.org/T422235) (owner: 10Giuseppe Lavagetto) [12:25:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299477 (https://phabricator.wikimedia.org/T426050) (owner: 10Dbrant) [12:27:07] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2005.codfw.wmnet [12:29:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [12:29:08] (03PS1) 10Dreamy Jazz: STVFormatter: Cast strings to float before passing to round [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299478 (https://phabricator.wikimedia.org/T428584) [12:29:14] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [12:29:17] fceratto@cumin1003 major-upgrade (PID 2327519) is awaiting input [12:30:04] Going to use scap again [12:31:10] (03PS13) 10Elukey: redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [12:31:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:31:53] (03PS2) 10Effie Mouzeli: docker-registry: switch to rdb1015 #3 [puppet] - 10https://gerrit.wikimedia.org/r/1299467 (https://phabricator.wikimedia.org/T418918) [12:32:05] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1016.eqiad.wmnet with reason: host reimage [12:32:20] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1015.eqiad.wmnet with reason: host reimage [12:32:31] (03CR) 10Kosta Harlan: [C:03+1] "LGTM! I'll follow up later with something to remove `wmgEnableHCaptchaAccountCreationAPI` but this is fine for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299477 (https://phabricator.wikimedia.org/T426050) (owner: 10Dbrant) [12:32:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299478 (https://phabricator.wikimedia.org/T428584) (owner: 10Dreamy Jazz) [12:32:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1237.eqiad.wmnet with OS trixie [12:33:07] (03CR) 10Ladsgroup: [C:03+2] logging: use ECS formatter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1298894 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [12:33:13] (03PS24) 10Ayounsi: Create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [12:33:15] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:33:18] (03CR) 10Ladsgroup: [C:03+2] images._error: write message to body in error case [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1299441 (https://phabricator.wikimedia.org/T417577) (owner: 10Hnowlan) [12:33:31] (03Abandoned) 10Effie Mouzeli: changeprop: switch to rdb1015 (staging) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299458 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [12:33:34] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1184: Upgrading db1184.eqiad.wmnet [12:33:54] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1184: Upgrading db1184.eqiad.wmnet [12:34:09] (03Merged) 10jenkins-bot: STVFormatter: Cast strings to float before passing to round [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299478 (https://phabricator.wikimedia.org/T428584) (owner: 10Dreamy Jazz) [12:34:10] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:34:30] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2153: Upgrading db2153.codfw.wmnet [12:34:38] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1299478|STVFormatter: Cast strings to float before passing to round (T428584)]] [12:34:46] (03CR) 10Btullis: [C:03+2] Update the k8s deployment tokens for wdqs namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1298308 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [12:34:46] T428584: TypeError when tallying STV elections - https://phabricator.wikimedia.org/T428584 [12:34:50] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1237: repool after maintenance db1237 [12:34:53] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2153: Upgrading db2153.codfw.wmnet [12:35:07] (03PS1) 10Marostegui: Revert "db1237: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1299479 [12:35:13] (03PS2) 10Effie Mouzeli: changeprop: switch to rdb1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299460 (https://phabricator.wikimedia.org/T418918) [12:35:53] (03CR) 10Ayounsi: [C:03+2] Add rack depool strategy for role aux_k8s::worker [puppet] - 10https://gerrit.wikimedia.org/r/1299471 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [12:36:05] (03CR) 10Ayounsi: [C:03+2] "Self merge as NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1299471 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [12:36:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:36:25] (03PS3) 10Effie Mouzeli: changeprop: switch to rdb1015 #4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299460 (https://phabricator.wikimedia.org/T418918) [12:36:39] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1299478|STVFormatter: Cast strings to float before passing to round (T428584)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:36:44] (03PS2) 10Effie Mouzeli: changeprop-jobqueue: switch to rdb1015 #5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299462 (https://phabricator.wikimedia.org/T418918) [12:36:47] Dreamy_Jazz: I’ll sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1299477 when you’re done [12:37:20] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [12:37:30] Sure, I'll ping you when I'm done [12:37:36] (03PS14) 10Elukey: redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [12:37:53] cwilliams@cumin1003 major-upgrade (PID 2369121) is awaiting input [12:37:54] (03PS2) 10Effie Mouzeli: ratelimit: switch to rdb1015 #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299464 (https://phabricator.wikimedia.org/T418918) [12:38:22] (03CR) 10Elukey: "@jhathaway@wikimedia.org I like it, I fixed tests and tuned a little more the code. Lemme know if you like it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [12:38:23] (03PS2) 10Effie Mouzeli: redioscope: switch to rdb1015 #7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299459 (https://phabricator.wikimedia.org/T418918) [12:38:29] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1184.eqiad.wmnet with OS trixie [12:38:35] (03PS2) 10Effie Mouzeli: rest-gateway: switch to rdb1015 #8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299466 (https://phabricator.wikimedia.org/T418918) [12:39:34] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:14] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2153.codfw.wmnet with OS trixie [12:40:29] !log installing wireshark security updates [12:40:30] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1016.eqiad.wmnet with reason: host reimage [12:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:57] (03CR) 10Marostegui: [C:03+2] Revert "db1237: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1299479 (owner: 10Marostegui) [12:41:01] jouncebot: nowandnext [12:41:01] For the next 0 hour(s) and 18 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1200) [12:41:01] In 0 hour(s) and 18 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1300) [12:41:40] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299478|STVFormatter: Cast strings to float before passing to round (T428584)]] (duration: 07m 02s) [12:41:41] (03Merged) 10jenkins-bot: logging: use ECS formatter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1298894 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [12:41:44] T428584: TypeError when tallying STV elections - https://phabricator.wikimedia.org/T428584 [12:41:50] kostajh: over to you [12:41:55] Thanks [12:42:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299477 (https://phabricator.wikimedia.org/T426050) (owner: 10Dbrant) [12:42:14] (03Merged) 10jenkins-bot: images._error: write message to body in error case [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1299441 (https://phabricator.wikimedia.org/T417577) (owner: 10Hnowlan) [12:42:43] !log increase OSPF cost on ssw1-a1-codfw link to lsw1-a4-codfw to force traffic via alternate spine T427357 [12:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:47] T427357: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357 [12:43:02] (03Merged) 10jenkins-bot: hCaptcha: Roll out to all wikis for api account creation. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299477 (https://phabricator.wikimedia.org/T426050) (owner: 10Dbrant) [12:43:26] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1299477|hCaptcha: Roll out to all wikis for api account creation. (T426050)]] [12:43:30] T426050: Roll out hCaptcha for use on app clients for enwiki - https://phabricator.wikimedia.org/T426050 [12:45:08] !log ayounsi@cumin1003 START - Cookbook sre.network.depool-rack with action 'depool' for codfw rack A4 [12:45:18] !log shut sub-interfaces for row A/B legacy vlans on cr1-codfw T427357 [12:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:28] !log kharlan@deploy1003 kharlan, dbrant: Backport for [[gerrit:1299477|hCaptcha: Roll out to all wikis for api account creation. (T426050)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:45:44] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1015.eqiad.wmnet with reason: host reimage [12:46:08] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.depool-rack (exit_code=99) with action 'depool' for codfw rack A4 [12:46:28] !log kharlan@deploy1003 kharlan, dbrant: Continuing with deployment [12:47:25] (03CR) 10Brouberol: [C:03+2] dse-k8s: remove the kafka-ui namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298268 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [12:47:29] (03CR) 10Brouberol: [C:03+2] dse-k8s: remove the kafka-ui helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298269 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [12:47:37] (03CR) 10CI reject: [V:04-1] dse-k8s: remove the kafka-ui namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298268 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [12:47:38] (03CR) 10CI reject: [V:04-1] dse-k8s: remove the kafka-ui helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298269 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [12:50:15] (03CR) 10Ottomata: stream: mediawiki.user_change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299422 (https://phabricator.wikimedia.org/T423952) (owner: 10JavierMonton) [12:50:32] (03CR) 10CDanis: [C:03+1] Sort webrequest_sampled_live dimensions alphabetically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299415 (owner: 10Ayounsi) [12:50:47] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299477|hCaptcha: Roll out to all wikis for api account creation. (T426050)]] (duration: 07m 21s) [12:50:51] T426050: Roll out hCaptcha for use on app clients for enwiki - https://phabricator.wikimedia.org/T426050 [12:50:59] (03PS1) 10Santiago Faci: Test Kitchen UI: Setting log level to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299485 [12:51:25] (03PS3) 10Brouberol: dse-k8s: remove the kafka-ui namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298268 (https://phabricator.wikimedia.org/T428053) [12:51:25] (03PS3) 10Brouberol: dse-k8s: remove the kafka-ui helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298269 (https://phabricator.wikimedia.org/T428053) [12:53:41] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1184.eqiad.wmnet with reason: host reimage [12:54:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - lsw1-a4-codfw:et-0/0/55 (Core: ssw1-a1-codfw:et-0/0/3 {#230403800031}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-a4-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:55:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:40] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:56:08] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299491 [12:56:46] !log lsw1-a4-codfw> request system reboot - T427357 [12:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:50] T427357: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357 [12:57:08] !log ayounsi@cumin1003 START - Cookbook sre.network.depool-rack with action 'depool' for codfw rack A4 [12:57:33] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb1016.eqiad.wmnet with OS trixie [12:57:44] 06SRE, 10ServiceOps-Upgrades-Hardware, 13Patch-For-Review, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#11999296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1003 for host rdb1016.eqiad.wmnet with OS trix... [12:58:29] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2153.codfw.wmnet with reason: host reimage [12:58:36] (03CR) 10Brouberol: [C:03+2] dse-k8s: remove kafka-ui kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1298265 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [12:59:45] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1184.eqiad.wmnet with reason: host reimage [12:59:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - lsw1-a4-codfw:et-0/0/55 (Core: ssw1-a1-codfw:et-0/0/3 {#230403800031}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-a4-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:00:04] Lucas_WMDE, urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1300). [13:00:04] manfredi, cscott, and Neriah: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] PROBLEM - BFD status on ssw1-a1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:00:13] PROBLEM - BFD status on ssw1-a8-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:00:17] Hi, I am here [13:00:24] h [13:00:26] i [13:00:28] (03PS1) 10Dreamy Jazz: STVFormatter: Cast strings to float before passing to round [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299495 (https://phabricator.wikimedia.org/T428584) [13:00:32] ayounsi@cumin1003 depool-rack (PID 2377393) is awaiting input [13:00:39] FIRING: CoreBGPDown: Core BGP session down between ssw1-a8-codfw and lsw1-a4-codfw (10.192.252.6) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-a8-codfw:9804&var-bgp_group=EVPN_IBGP&var-bgp_neighbor=lsw1-a4-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:00:41] PROBLEM - Host lsw1-a4-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:00:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-a8-codfw:et-0/0/3 (Core: lsw1-a4-codfw:et-0/0/54 {#230403800024}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-a8-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:00:56] o/ [13:01:06] o/ [13:01:23] (03CR) 10Brouberol: [V:03+2 C:03+2] dse-k8s: remove the kafka-ui helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298269 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [13:01:42] (03CR) 10Brouberol: [V:03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298269 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [13:02:05] PROBLEM - Host lsw1-a4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:02:05] PROBLEM - Host lsw1-a4-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:02:18] I can deploy, let’s start with Neriah [13:02:37] 🔥 [13:02:57] (03CR) 10Bearloga: [C:03+1] Deploy GrowthBook 4.4.0 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298711 (https://phabricator.wikimedia.org/T427506) (owner: 10Santiago Faci) [13:03:02] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb1015.eqiad.wmnet with OS trixie [13:03:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298654 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [13:03:15] 06SRE, 10ServiceOps-Upgrades-Hardware, 13Patch-For-Review, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#11999348 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1003 for host rdb1015.eqiad.wmnet with OS trix... [13:03:22] FIRING: CertAlmostExpired: gNMI TLS certificate for lsw1-a4-codfw.mgmt.codfw.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:04:13] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver: apply [13:04:17] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver: apply [13:04:18] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2153.codfw.wmnet with reason: host reimage [13:04:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:38] (03CR) 10Xcollazo: stream: mediawiki.user_change (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299422 (https://phabricator.wikimedia.org/T423952) (owner: 10JavierMonton) [13:04:43] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ttmserver: apply [13:04:50] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver: apply [13:04:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): hw troubleshooting: CPU1 thermal fault for wdqs1015.eqiad.wmnet - https://phabricator.wikimedia.org/T427852#11999398 (10Gehel) [13:04:57] (03PS1) 10C. Scott Ananian: Store indicators in ContentHolder: forward compatibility [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299497 (https://phabricator.wikimedia.org/T427622) [13:05:34] we deployed a similar change to incubatorwiki yesterday, without any problems [13:05:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299497 (https://phabricator.wikimedia.org/T427622) (owner: 10C. Scott Ananian) [13:05:39] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a4-codfw (10.192.252.6) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:05:42] yes, I remember :) [13:05:43] do I need to do testing this time too? [13:05:46] good to hear there were no problems so far [13:05:51] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - lsw1-a4-codfw:et-0/0/55 (Core: ssw1-a1-codfw:et-0/0/3 {#230403800031}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:05:56] yes, we’d still want some testing on commonswiki [13:06:00] though I’m not sure what testing can be done tbh [13:06:04] what did you test yesterday? [13:07:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:07:22] I created an account manually and automatically, and I saw that in both cases a message was sent when it should have been [13:07:32] ah, okay [13:08:10] i don't think there's a better test to do :) [13:08:13] RECOVERY - Host lsw1-a4-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.35 ms [13:08:21] yeah [13:08:22] RESOLVED: CertAlmostExpired: gNMI TLS certificate for lsw1-a4-codfw.mgmt.codfw.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:08:42] but I’m not sure we need to do that kind of test (which leaves permanent traces in the new user log etc.) every time [13:08:53] (03Merged) 10jenkins-bot: Enable wgNewUserMessageOnFirstEdit on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298654 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [13:09:19] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1298654|Enable wgNewUserMessageOnFirstEdit on commonswiki (T426206)]] [13:09:23] T426206: Per global RfC, only welcome users on Wikimedia projects where they created their account or have edited - https://phabricator.wikimedia.org/T426206 [13:09:39] RECOVERY - Host lsw1-a4-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.36 ms [13:09:55] meanwhile, 91% of this job’s time was spent waiting for castor-save-workspace-cache. this is getting ridiculous :( https://integration.wikimedia.org/ci/job/operations-mw-config-php83-composer-test/3236/console [13:10:13] RECOVERY - BFD status on ssw1-a1-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:10:13] RECOVERY - BFD status on ssw1-a8-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:10:22] agree [13:10:51] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - lsw1-a4-codfw:et-0/0/55 (Core: ssw1-a1-codfw:et-0/0/3 {#230403800031}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:10:52] i already deleted yesterday's accounts :) [13:10:55] i mean vanishing [13:11:20] !log lucaswerkmeister-wmde@deploy1003 neriah, lucaswerkmeister-wmde: Backport for [[gerrit:1298654|Enable wgNewUserMessageOnFirstEdit on commonswiki (T426206)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:11:28] (03CR) 10Brouberol: [C:03+2] dse-k8s: remove the kafka-ui namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298268 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [13:11:44] that doesn’t delete them from the database though ^^ [13:11:52] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.depool-rack (exit_code=99) with action 'depool' for codfw rack A4 [13:12:07] ya [13:12:17] ERRORS: 159 requests attempted to mwdebug.discovery.wmnet. Errors connecting to 1 host. [13:12:18] hm [13:12:19] RECOVERY - Host lsw1-a4-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms [13:12:24] : HTTPSConnectionPool(host='mwdebug.discovery.wmnet', port=4444): Read timed out. (read timeout=10) [13:12:48] let’s retry the testserver checks [13:13:38] looks like they worked this time but it didn’t ping us about it on IRC [13:14:06] Neriah: do you want to test anything or should I continue with the deployment directly? [13:14:27] I think you can continue [13:14:37] !log lucaswerkmeister-wmde@deploy1003 neriah, lucaswerkmeister-wmde: Continuing with deployment [13:14:39] alright [13:14:50] we have nothing better to do.. [13:15:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a4-codfw (10.192.252.6) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:15:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/3 (Core: lsw1-a4-codfw:et-0/0/55 {#230403800031}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:15:56] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1184.eqiad.wmnet with OS trixie [13:16:28] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2005.codfw.wmnet [13:16:30] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2005.codfw.wmnet [13:16:40] Lucas_WMDE: I'll wait a few days and if I don't see any problems I'll enable it on the other sites where wgNewUserMessageOnAutoCreate was enabled. [13:16:48] sounds good to me, thanks! [13:16:48] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2251-2253].codfw.wmnet [13:16:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2251-2253].codfw.wmnet [13:17:04] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-worker2006.codfw.wmnet [13:17:06] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-worker2006.codfw.wmnet [13:17:08] (03PS7) 10Tiziano Fogli: slothslos/report2drive: add modules [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) [13:17:08] (03PS9) 10Tiziano Fogli: slothslos/report2drive: add profiles [puppet] - 10https://gerrit.wikimedia.org/r/1298295 (https://phabricator.wikimedia.org/T425795) [13:17:08] (03PS9) 10Tiziano Fogli: slothslos/report2drive: instantiate resources [puppet] - 10https://gerrit.wikimedia.org/r/1298296 (https://phabricator.wikimedia.org/T425795) [13:17:08] (03PS9) 10Tiziano Fogli: slothslos/report2drive: add Hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/1298297 (https://phabricator.wikimedia.org/T425795) [13:17:09] (03PS9) 10Tiziano Fogli: slothslos/report2drive: enable deep merge for vars [puppet] - 10https://gerrit.wikimedia.org/r/1298298 (https://phabricator.wikimedia.org/T425795) [13:17:48] (after that, I think it will be possible to enable it by default, but it will need to be discussed somewhere.) [13:18:01] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, and 2 others: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11999540 (10ABran-WMF) >>! In T353891#11998503, @gerritbot wrote: > Change #1299448 had a related patch set uploaded (b... [13:18:47] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299491 (owner: 10Muehlenhoff) [13:18:59] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1298654|Enable wgNewUserMessageOnFirstEdit on commonswiki (T426206)]] (duration: 09m 40s) [13:19:03] T426206: Per global RfC, only welcome users on Wikimedia projects where they created their account or have edited - https://phabricator.wikimedia.org/T426206 [13:19:47] (03Merged) 10jenkins-bot: dse-k8s: remove the kafka-ui namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298268 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [13:19:53] (03Merged) 10jenkins-bot: dse-k8s: remove the kafka-ui helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298269 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [13:20:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1237: repool after maintenance db1237 [13:20:18] !log ayounsi@cumin1003 START - Cookbook sre.mysql.pool pool db2241: rack depool [13:20:51] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2153.codfw.wmnet with OS trixie [13:21:18] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [13:21:28] (03CR) 10CI reject: [V:04-1] Store indicators in ContentHolder: forward compatibility [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299497 (https://phabricator.wikimedia.org/T427622) (owner: 10C. Scott Ananian) [13:21:57] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:22:05] oh, deploy finished [13:22:08] I got distracted, sorry [13:22:25] I am around [13:22:26] manfredi: you’re next, do you want to deploy your config change yourself? [13:22:55] Lucas_WMDE: I have never done it myself, if you could do it for me I'd appreciate it [13:22:59] ok sure! [13:23:22] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:23:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298834 (https://phabricator.wikimedia.org/T428291) (owner: 10Mmartorana) [13:23:43] (https://spiderpig.wikimedia.org/jobs/2219 if you have spiderpig access) [13:24:08] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:24:40] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 not rebooting - https://phabricator.wikimedia.org/T428542#11999578 (10Jclark-ctr) @marostegui. Dell did just respond. no other ticket created for this server. This is the first one created for this issue. For the issue reported we usually update the firmware to... [13:24:56] Lucas_WMDE: I can spiderpig my changes myself when you're done with manfredi [13:25:02] (03Merged) 10jenkins-bot: config: Disable EmailConfirmationBanner on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298834 (https://phabricator.wikimedia.org/T428291) (owner: 10Mmartorana) [13:25:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:25:25] cscott: ack [13:25:28] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1298834|config: Disable EmailConfirmationBanner on all wikis (T428291)]] [13:25:32] T428291: Disable Email Confirmation Banner on all wikis - https://phabricator.wikimedia.org/T428291 [13:26:44] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1184: Migration of db1184.eqiad.wmnet completed [13:26:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11999603 (10ayounsi) [13:27:26] (03PS1) 10Dreamy Jazz: SecurePollLogPager: Cast user IDs to ints before use [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299502 (https://phabricator.wikimedia.org/T428599) [13:27:29] !log lucaswerkmeister-wmde@deploy1003 mmartorana, lucaswerkmeister-wmde: Backport for [[gerrit:1298834|config: Disable EmailConfirmationBanner on all wikis (T428291)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:27:43] (03PS2) 10Ssingh: hcaptcha: Allow the wikisource.org bare domain in frame-ancestors CSP [puppet] - 10https://gerrit.wikimedia.org/r/1299427 (https://phabricator.wikimedia.org/T428539) (owner: 10Kosta Harlan) [13:27:51] manfredi: anything to test for this change? [13:27:58] (on WikimediaDebug, that is) [13:28:05] no you can go ahead [13:28:10] !log lucaswerkmeister-wmde@deploy1003 mmartorana, lucaswerkmeister-wmde: Continuing with deployment [13:28:11] ok [13:28:13] 10ops-codfw, 06SRE, 06DC-Ops: Move test host in codfw rack B3 or D3 - https://phabricator.wikimedia.org/T428041#11999611 (10Jhancock.wm) 05Open→03Resolved decommed [13:28:18] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [13:28:23] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [13:28:28] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [13:28:28] cscott: you can probably already start gate-and-submit for your backports btw [13:28:32] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8678/co" [puppet] - 10https://gerrit.wikimedia.org/r/1299427 (https://phabricator.wikimedia.org/T428539) (owner: 10Kosta Harlan) [13:28:35] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [13:28:36] (03PS25) 10Ayounsi: Create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [13:29:03] lucas can i bundle a wmf.5 and a wmf.6 patch into the same spiderpig session? [13:29:09] i've never tried that before [13:29:20] (03CR) 10Ayounsi: [C:03+2] wmf_netflow: specify data kind (bool/number) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299424 (owner: 10Ayounsi) [13:29:51] cscott: should work, yeah [13:29:55] (03CR) 10Ssingh: [V:03+1 C:03+1] "Let me know when you want to merge this." [puppet] - 10https://gerrit.wikimedia.org/r/1299427 (https://phabricator.wikimedia.org/T428539) (owner: 10Kosta Harlan) [13:30:42] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2153: Migration of db2153.codfw.wmnet completed [13:31:34] (03Merged) 10jenkins-bot: wmf_netflow: specify data kind (bool/number) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299424 (owner: 10Ayounsi) [13:31:45] (03CR) 10C. Scott Ananian: [C:03+2] "recheck" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299497 (https://phabricator.wikimedia.org/T427622) (owner: 10C. Scott Ananian) [13:32:01] (03CR) 10Ayounsi: [C:03+2] Create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [13:32:16] (03CR) 10C. Scott Ananian: [C:03+2] "get a headstart on merge pre-deploy" [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298929 (https://phabricator.wikimedia.org/T423700) (owner: 10C. Scott Ananian) [13:32:21] (03CR) 10C. Scott Ananian: [C:03+2] "get a headstart on merge pre-deploy" [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298926 (owner: 10C. Scott Ananian) [13:32:26] (03CR) 10C. Scott Ananian: [C:03+2] "get a headstart on merge pre-deploy" [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298927 (owner: 10C. Scott Ananian) [13:32:29] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1298834|config: Disable EmailConfirmationBanner on all wikis (T428291)]] (duration: 07m 01s) [13:32:33] (03CR) 10C. Scott Ananian: [C:03+2] "get a headstart on merge pre-deploy" [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298925 (https://phabricator.wikimedia.org/T428336) (owner: 10C. Scott Ananian) [13:32:33] T428291: Disable Email Confirmation Banner on all wikis - https://phabricator.wikimedia.org/T428291 [13:32:47] cscott: over to you [13:32:57] !log ayounsi@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [13:33:54] !log ayounsi@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [13:34:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298929 (https://phabricator.wikimedia.org/T423700) (owner: 10C. Scott Ananian) [13:34:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298926 (owner: 10C. Scott Ananian) [13:34:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298927 (owner: 10C. Scott Ananian) [13:34:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298925 (https://phabricator.wikimedia.org/T428336) (owner: 10C. Scott Ananian) [13:34:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299497 (https://phabricator.wikimedia.org/T427622) (owner: 10C. Scott Ananian) [13:34:36] Lucas_WMDE: All good with me? thanks a lot! [13:34:50] 10SRE-swift-storage, 10MediaWiki-Uploading: "Could not read file" error during upload - https://phabricator.wikimedia.org/T428315#11999688 (10MatthewVernon) I've gone looking in swift logs for the first of these two objects, and I find these four hits, in time order: ` Jun 5 00:53:39 ms-fe1017 proxy-server: 1... [13:34:54] manfredi: yup, should be done :) [13:35:06] thank you! [13:35:09] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [13:35:23] Lucas_WMDE: thanks! [13:36:01] (03Merged) 10jenkins-bot: Create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [13:37:49] (03Merged) 10jenkins-bot: Store indicators in ContentHolder: forward compatibility [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299497 (https://phabricator.wikimedia.org/T427622) (owner: 10C. Scott Ananian) [13:37:56] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [13:37:57] (03Merged) 10jenkins-bot: Simplify fragment processing [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298929 (https://phabricator.wikimedia.org/T423700) (owner: 10C. Scott Ananian) [13:38:04] (03Merged) 10jenkins-bot: Move ::getFragmentsToTransform() to Content{Text,DOM}TransformStage [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298926 (owner: 10C. Scott Ananian) [13:38:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299495 (https://phabricator.wikimedia.org/T428584) (owner: 10Dreamy Jazz) [13:38:10] (03Merged) 10jenkins-bot: OutputTransform: Rename DeduplicateStyles and ExpandToAbsoluteUrls stages [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298927 (owner: 10C. Scott Ananian) [13:38:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299502 (https://phabricator.wikimedia.org/T428599) (owner: 10Dreamy Jazz) [13:39:07] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [13:39:35] (03Merged) 10jenkins-bot: Reset DeduplicateStyles state between different pipeline executions [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1298925 (https://phabricator.wikimedia.org/T428336) (owner: 10C. Scott Ananian) [13:40:03] !log btullis@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd-codfw [13:40:10] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1298929|Simplify fragment processing (T423700)]], [[gerrit:1298926|Move ::getFragmentsToTransform() to Content{Text,DOM}TransformStage]], [[gerrit:1298927|OutputTransform: Rename DeduplicateStyles and ExpandToAbsoluteUrls stages]], [[gerrit:1298925|Reset DeduplicateStyles state between different pipeline executions (T428336 T428215)]], [[gerrit:1299497| [13:40:10] Store indicators in ContentHolder: forward compatibility (T427622)]] [13:40:17] T423700: Apply OutputTransform stages to other fragment than BODY in ContentHolder - https://phabricator.wikimedia.org/T423700 [13:40:18] T428336: DiscussionTools preview is missing TemplateStyles styling - https://phabricator.wikimedia.org/T428336 [13:40:18] T428215: The CSS-subpages of templates and CSS in Styles tab of Index pages is broken in the Page NS - https://phabricator.wikimedia.org/T428215 [13:40:19] T427622: Process indicators in ContentHolder - https://phabricator.wikimedia.org/T427622 [13:40:50] hm, truncated !log message, T427622 won’t get stashbot comments :( [13:41:08] (03PS1) 10Ayounsi: wmf_netflow: cast numbers to numbers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299507 [13:41:24] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [13:42:11] !log cscott@deploy1003 cscott: Backport for [[gerrit:1298929|Simplify fragment processing (T423700)]], [[gerrit:1298926|Move ::getFragmentsToTransform() to Content{Text,DOM}TransformStage]], [[gerrit:1298927|OutputTransform: Rename DeduplicateStyles and ExpandToAbsoluteUrls stages]], [[gerrit:1298925|Reset DeduplicateStyles state between different pipeline executions (T428336 T428215)]], [[gerrit:1299497|Store indicators [13:42:11] in ContentHolder: forward compatibility (T427622)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:42:33] PROBLEM - BFD status on lsw1-a7-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:44:22] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [13:45:36] !log ayounsi@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [13:45:44] (03CR) 10Alex Paskulin: [C:03+1] "Tested locally and works as expected! Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299454 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [13:46:07] !log ayounsi@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [13:47:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2027.codfw.wmnet to cluster codfw and group A [13:47:33] RECOVERY - BFD status on lsw1-a7-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:47:57] (03CR) 10Gkyziridis: "Thnx for your comment!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299454 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [13:48:10] (03Abandoned) 10Ayounsi: wmf_netflow: cast numbers to numbers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299507 (owner: 10Ayounsi) [13:48:32] (03PS7) 10Arnaudb: mailman: discard stale pending subscription requests [puppet] - 10https://gerrit.wikimedia.org/r/1299448 (https://phabricator.wikimedia.org/T353891) [13:48:32] (03CR) 10Arnaudb: [C:03+2] "the script has been run on production, the next iteration should be faster than 25+ minutes" [puppet] - 10https://gerrit.wikimedia.org/r/1299448 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [13:48:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2027.codfw.wmnet to cluster codfw and group A [13:48:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2045.codfw.wmnet to cluster codfw and group A [13:49:11] (03PS1) 10Arnaudb: mailman: add alerts for discard_stale_subscriptions [alerts] - 10https://gerrit.wikimedia.org/r/1299506 (https://phabricator.wikimedia.org/T353891) [13:49:25] (03PS1) 10Ayounsi: Revert "wmf_netflow: specify data kind (bool/number)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299508 [13:49:48] (03CR) 10Arnaudb: [C:03+2] "new alerts for mailman" [alerts] - 10https://gerrit.wikimedia.org/r/1299506 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [13:50:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2045.codfw.wmnet to cluster codfw and group A [13:50:46] !log cscott@deploy1003 cscott: Continuing with deployment [13:51:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow2004.codfw.wmnet to drbd [13:51:56] (03Merged) 10jenkins-bot: mailman: add alerts for discard_stale_subscriptions [alerts] - 10https://gerrit.wikimedia.org/r/1299506 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [13:52:05] (03CR) 10Ayounsi: [C:03+2] Revert "wmf_netflow: specify data kind (bool/number)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299508 (owner: 10Ayounsi) [13:52:22] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub: apply [13:52:26] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub: apply [13:53:39] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, and 2 others: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11999794 (10ABran-WMF) 05Open→03Stalled p:05Unbreak!→03Medium purge done: ` real 27m10.367s user 1m40.23... [13:54:01] (03Merged) 10jenkins-bot: Revert "wmf_netflow: specify data kind (bool/number)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299508 (owner: 10Ayounsi) [13:55:01] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1298929|Simplify fragment processing (T423700)]], [[gerrit:1298926|Move ::getFragmentsToTransform() to Content{Text,DOM}TransformStage]], [[gerrit:1298927|OutputTransform: Rename DeduplicateStyles and ExpandToAbsoluteUrls stages]], [[gerrit:1298925|Reset DeduplicateStyles state between different pipeline executions (T428336 T428215)]], [[gerrit:1299497 [13:55:01] |Store indicators in ContentHolder: forward compatibility (T427622)]] (duration: 14m 51s) [13:55:07] T423700: Apply OutputTransform stages to other fragment than BODY in ContentHolder - https://phabricator.wikimedia.org/T423700 [13:55:08] T428336: DiscussionTools preview is missing TemplateStyles styling - https://phabricator.wikimedia.org/T428336 [13:55:08] T428215: The CSS-subpages of templates and CSS in Styles tab of Index pages is broken in the Page NS - https://phabricator.wikimedia.org/T428215 [13:55:08] T427622: Process indicators in ContentHolder - https://phabricator.wikimedia.org/T427622 [13:55:11] Lucas_WMDE: all done! [13:55:18] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-toolhub: apply [13:55:25] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-toolhub: apply [13:55:33] cscott: thanks! [13:55:43] oh, new changes appear [13:55:47] Dreamy_Jazz: do you still want to deploy? [13:55:58] Yes thank [13:56:00] jouncebot: nowandnext [13:56:00] For the next 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1300) [13:56:01] In 0 hour(s) and 3 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1400) [13:56:01] !log ayounsi@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [13:56:06] Probably still time [13:56:27] !log ayounsi@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [13:56:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299495 (https://phabricator.wikimedia.org/T428584) (owner: 10Dreamy Jazz) [13:56:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299502 (https://phabricator.wikimedia.org/T428599) (owner: 10Dreamy Jazz) [13:56:52] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [13:56:55] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [13:57:24] (03CR) 10Elukey: [C:03+1] "I am fine if your team wants to proceed in this way, but as I wrote before we should strive to avoid special configs if possible. So ideal" [cookbooks] - 10https://gerrit.wikimedia.org/r/1297100 (owner: 10Federico Ceratto) [13:58:25] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to for - https://phabricator.wikimedia.org/T427553#11999807 (10APDube-WMF) Thanks @RLazarus - I just need access to the specific dashboard linked above for now. Will let you know in case an expan... [13:58:29] PROBLEM - BFD status on lsw1-c2-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:58:39] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [13:58:45] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [13:58:58] FIRING: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:59:24] (03Merged) 10jenkins-bot: STVFormatter: Cast strings to float before passing to round [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299495 (https://phabricator.wikimedia.org/T428584) (owner: 10Dreamy Jazz) [13:59:26] (03Merged) 10jenkins-bot: SecurePollLogPager: Cast user IDs to ints before use [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299502 (https://phabricator.wikimedia.org/T428599) (owner: 10Dreamy Jazz) [13:59:32] (03PS10) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [13:59:33] (03PS1) 10Andrew Bogott: cloud cumin: use ubuntu@ when reaching Trove database instances [puppet] - 10https://gerrit.wikimedia.org/r/1299510 (https://phabricator.wikimedia.org/T422801) [13:59:35] (03PS1) 10Andrew Bogott: trove: install cumin key in new DB instances [puppet] - 10https://gerrit.wikimedia.org/r/1299511 (https://phabricator.wikimedia.org/T422801) [13:59:50] (03PS2) 10Andrew Bogott: trove: install cumin key in new DB instances [puppet] - 10https://gerrit.wikimedia.org/r/1299511 (https://phabricator.wikimedia.org/T422801) [13:59:50] (03PS11) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [13:59:51] (03PS2) 10Andrew Bogott: cloud cumin: use ubuntu@ when reaching Trove database instances [puppet] - 10https://gerrit.wikimedia.org/r/1299510 (https://phabricator.wikimedia.org/T422801) [13:59:55] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1299495|STVFormatter: Cast strings to float before passing to round (T428584)]], [[gerrit:1299502|SecurePollLogPager: Cast user IDs to ints before use (T428599)]] [14:00:01] T428584: TypeError when tallying STV elections - https://phabricator.wikimedia.org/T428584 [14:00:01] T428599: Special:SecurePollLog: TypeError when opening log page - https://phabricator.wikimedia.org/T428599 [14:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1400) [14:00:08] 06SRE, 10SRE-Access-Requests: Rotating production SSH-Key for @Michael to a Yubikey-based one - https://phabricator.wikimedia.org/T428037#11999819 (10Michael) Thank you @Raine, I can confirm that I can use the new key to connect to stat and deploy hosts. So, I think my old public production key (the one a... [14:01:08] (03PS1) 10Muehlenhoff: Record LDAP access for sjones-ctr [puppet] - 10https://gerrit.wikimedia.org/r/1299512 [14:01:13] 10SRE-SLO: Sloth dashboard performance improvement - https://phabricator.wikimedia.org/T425564#11999824 (10tappof) 05Open→03Resolved a:03tappof [14:01:17] (03PS4) 10Ayounsi: Sort webrequest_sampled_live dimensions alphabetically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299415 [14:02:06] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1299495|STVFormatter: Cast strings to float before passing to round (T428584)]], [[gerrit:1299502|SecurePollLogPager: Cast user IDs to ints before use (T428599)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:02:29] RECOVERY - BFD status on lsw1-c2-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:02:31] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [14:02:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow2004.codfw.wmnet to drbd [14:02:57] 06SRE, 10SRE-Access-Requests: Rotating production SSH-Key for @Michael to a Yubikey-based one - https://phabricator.wikimedia.org/T428037#11999846 (10Raine) >>! In T428037#11999819, @Michael wrote: > Thank you @Raine, I can confirm that I can use the new key to connect to stat and deploy hosts. > > So, I... [14:03:10] 10SRE-SLO: Sloth: enable alerting - https://phabricator.wikimedia.org/T428617 (10tappof) 03NEW [14:03:11] (03PS3) 10Andrew Bogott: cloud cumin: use ubuntu@ when reaching Trove database instances [puppet] - 10https://gerrit.wikimedia.org/r/1299510 (https://phabricator.wikimedia.org/T422801) [14:03:36] (03CR) 10Andrew Bogott: [C:03+2] trove: install cumin key in new DB instances [puppet] - 10https://gerrit.wikimedia.org/r/1299511 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [14:03:41] (03CR) 10Ayounsi: [C:03+2] Sort webrequest_sampled_live dimensions alphabetically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299415 (owner: 10Ayounsi) [14:03:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of rpki2003.codfw.wmnet to drbd [14:04:06] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for sjones-ctr [puppet] - 10https://gerrit.wikimedia.org/r/1299512 (owner: 10Muehlenhoff) [14:05:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:05:58] RESOLVED: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:03] (03Merged) 10jenkins-bot: Sort webrequest_sampled_live dimensions alphabetically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299415 (owner: 10Ayounsi) [14:06:07] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2241: rack depool [14:06:30] !log ayounsi@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [14:06:48] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299495|STVFormatter: Cast strings to float before passing to round (T428584)]], [[gerrit:1299502|SecurePollLogPager: Cast user IDs to ints before use (T428599)]] (duration: 06m 53s) [14:06:54] T428584: TypeError when tallying STV elections - https://phabricator.wikimedia.org/T428584 [14:06:55] T428599: Special:SecurePollLog: TypeError when opening log page - https://phabricator.wikimedia.org/T428599 [14:07:02] !log ayounsi@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [14:07:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1299457 (owner: 10Slyngshede) [14:07:23] !log Afternoon UTC backport window done [14:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:30] (03PS1) 10Tiziano Fogli: sloth: add abstract-wikipedia task receiver [puppet] - 10https://gerrit.wikimedia.org/r/1299515 (https://phabricator.wikimedia.org/T428617) [14:10:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host parsoidtest1001.eqiad.wmnet [14:10:58] FIRING: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:12] (03PS12) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [14:12:12] (03PS4) 10Andrew Bogott: cloud cumin: use ubuntu@ when reaching Trove database instances [puppet] - 10https://gerrit.wikimedia.org/r/1299510 (https://phabricator.wikimedia.org/T422801) [14:12:14] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1184: Migration of db1184.eqiad.wmnet completed [14:12:15] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:13:29] PROBLEM - BFD status on lsw1-d2-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:14:40] !log imported routinator 0.15.2-1bookworm to thirdparty/routinator for bookworm-wikimedia T428456 [14:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:45] T428456: Upgrade Routinator to 0.15.2 - https://phabricator.wikimedia.org/T428456 [14:16:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of rpki2003.codfw.wmnet to drbd [14:16:09] PROBLEM - Host rpki2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:16:10] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2153: Migration of db2153.codfw.wmnet completed [14:16:11] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:16:23] RECOVERY - Host rpki2003 is UP: PING WARNING - Packet loss = 33%, RTA = 887.20 ms [14:16:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host parsoidtest1001.eqiad.wmnet [14:17:29] RECOVERY - BFD status on lsw1-d2-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:17:44] (03PS1) 10Sbisson: ArticleGuidance: restrict beta deployment to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299516 [14:19:12] (03CR) 10Ottomata: stream: mediawiki.user_change (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299422 (https://phabricator.wikimedia.org/T423952) (owner: 10JavierMonton) [14:20:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on A:cephosd-codfw [14:20:09] (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1298932 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [14:21:23] (03CR) 10Arnaudb: [C:03+1] "instead of creating an erb file, it could be simpler to pass the --path as an argument to the script, which then will be configurable via " [puppet] - 10https://gerrit.wikimedia.org/r/1298932 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [14:23:31] (03CR) 10Hnowlan: [C:03+1] sloth: add abstract-wikipedia task receiver [puppet] - 10https://gerrit.wikimedia.org/r/1299515 (https://phabricator.wikimedia.org/T428617) (owner: 10Tiziano Fogli) [14:25:58] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:26:01] !log brouberol@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd-eqiad [14:26:17] !log brouberol@cumin1003 END (ERROR) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=97) rolling reboot on A:cephosd-eqiad [14:29:30] (03CR) 10Elukey: [C:03+1] sloth: add abstract-wikipedia task receiver [puppet] - 10https://gerrit.wikimedia.org/r/1299515 (https://phabricator.wikimedia.org/T428617) (owner: 10Tiziano Fogli) [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1430) [14:33:29] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc-wf2001.codfw.wmnet with OS trixie [14:35:00] !log brouberol@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd-eqiad [14:37:37] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:38:31] (03CR) 10Santiago Faci: [C:03+2] Deploy GrowthBook 4.4.0 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298711 (https://phabricator.wikimedia.org/T427506) (owner: 10Santiago Faci) [14:38:50] 06SRE, 10DNS, 06Traffic: 10.67.28.73 reverse DNS showing 2(SERVFAIL) - https://phabricator.wikimedia.org/T428573#12000128 (10cmooney) Thanks @Marostegui for the task. There are a few components here that need to work. Firstly our authdns servers are responsible for the entire 10.0.0.0/8 range (10.in-addr.a... [14:38:52] (03PS13) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [14:38:52] (03PS5) 10Andrew Bogott: cloud cumin: use ubuntu@ when reaching Trove database instances [puppet] - 10https://gerrit.wikimedia.org/r/1299510 (https://phabricator.wikimedia.org/T422801) [14:38:52] (03PS1) 10Andrew Bogott: Update default trove instance pubkey, again [puppet] - 10https://gerrit.wikimedia.org/r/1299518 [14:39:17] (03PS1) 10Effie Mouzeli: netbox: switch to rdb1015 [puppet] - 10https://gerrit.wikimedia.org/r/1299519 (https://phabricator.wikimedia.org/T418918) [14:39:33] (03CR) 10Tiziano Fogli: [C:03+2] sloth: add abstract-wikipedia task receiver [puppet] - 10https://gerrit.wikimedia.org/r/1299515 (https://phabricator.wikimedia.org/T428617) (owner: 10Tiziano Fogli) [14:39:42] (03PS1) 10JavierMonton: stream: mw-page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299520 (https://phabricator.wikimedia.org/T421237) [14:40:26] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:40:37] (03Merged) 10jenkins-bot: Deploy GrowthBook 4.4.0 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298711 (https://phabricator.wikimedia.org/T427506) (owner: 10Santiago Faci) [14:40:38] !log upgrade routinator in codfw to 0.15.2 T428456 [14:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:42] T428456: Upgrade Routinator to 0.15.2 - https://phabricator.wikimedia.org/T428456 [14:41:27] 06SRE, 10ServiceOps-Upgrades-Hardware, 13Patch-For-Review, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#12000140 (10jijiki) [14:41:37] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:43:00] (03CR) 10Federico Ceratto: sre.mysql: add local ruff.toml (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1297100 (owner: 10Federico Ceratto) [14:43:13] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql: add local ruff.toml [cookbooks] - 10https://gerrit.wikimedia.org/r/1297100 (owner: 10Federico Ceratto) [14:43:15] (03CR) 10Andrew Bogott: [C:03+2] Update default trove instance pubkey, again [puppet] - 10https://gerrit.wikimedia.org/r/1299518 (owner: 10Andrew Bogott) [14:43:36] (03PS1) 10JavierMonton: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299522 (https://phabricator.wikimedia.org/T421237) [14:44:46] (03CR) 10Federico Ceratto: [C:03+2] "Acknowledged" [cookbooks] - 10https://gerrit.wikimedia.org/r/1297100 (owner: 10Federico Ceratto) [14:45:09] (03PS4) 10Effie Mouzeli: mediawiki-common: add rdb1015 rdb1016 #1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299470 (https://phabricator.wikimedia.org/T418918) [14:45:23] (03CR) 10Kosta Harlan: "Whenever is convenient for you!" [puppet] - 10https://gerrit.wikimedia.org/r/1299427 (https://phabricator.wikimedia.org/T428539) (owner: 10Kosta Harlan) [14:45:34] (03PS5) 10Effie Mouzeli: mediawiki-common: add rdb1015 rdb1016 #1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299470 (https://phabricator.wikimedia.org/T418918) [14:45:43] (03PS1) 10JavierMonton: stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299523 (https://phabricator.wikimedia.org/T421237) [14:46:06] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [14:46:46] (03Merged) 10jenkins-bot: sre.mysql: add local ruff.toml [cookbooks] - 10https://gerrit.wikimedia.org/r/1297100 (owner: 10Federico Ceratto) [14:47:16] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [14:50:26] (03PS1) 10Tiziano Fogli: sloth: add editing task receiver [puppet] - 10https://gerrit.wikimedia.org/r/1299524 (https://phabricator.wikimedia.org/T428617) [14:52:01] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab[2002-2003].codfw.wmnet,phab[1004-1006].eqiad.wmnet with reason: T410849 [14:52:05] T410849: Update to Phorge/Arcanist upstream 2026-06-01 - https://phabricator.wikimedia.org/T410849 [14:52:06] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf2001.codfw.wmnet with reason: host reimage [14:52:20] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#12000193 (10Krd) Appear to work fine now. Thank you! [14:52:37] PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:54:17] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 364204320 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:54:56] (03CR) 10Dzahn: "just checking - this makes apache startup rely on /srv/ path existing but for gerrit specifically I think that's ok and our goal to also g" [puppet] - 10https://gerrit.wikimedia.org/r/1298939 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [14:55:18] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1640864 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:55:36] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:56:18] (03CR) 10Marco Fossati: [C:03+1] "LGTM, ready to backport." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298875 (https://phabricator.wikimedia.org/T423148) (owner: 10Kimberly Sarabia) [14:57:32] (03CR) 10Effie Mouzeli: "@elukey or Arzhel, the host is ready and I tested connectivity, would it be ok if you merge this at your convenience?" [puppet] - 10https://gerrit.wikimedia.org/r/1299519 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [14:58:14] (03CR) 10Ayounsi: [C:03+1] "Thanks, feel free to merge it anytime." [puppet] - 10https://gerrit.wikimedia.org/r/1299519 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [14:58:17] (03CR) 10Effie Mouzeli: "resolved the comment by accident 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1299519 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [14:58:32] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf2001.codfw.wmnet with reason: host reimage [14:59:47] (03CR) 10Effie Mouzeli: [C:03+2] "merging!" [puppet] - 10https://gerrit.wikimedia.org/r/1299519 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [15:00:05] jelto, arnoldokoth, mutante, and arnaudb: Time to do the SRE Collaboration Services office hours deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1500). [15:00:08] (03PS2) 10Effie Mouzeli: netbox: switch to rdb1015 [puppet] - 10https://gerrit.wikimedia.org/r/1299519 (https://phabricator.wikimedia.org/T418918) [15:01:19] !log brennen@deploy1003 Started deploy [phabricator/deployment@d244a3e]: deploy phab2002 for T410849 [15:01:23] T410849: Update to Phorge/Arcanist upstream 2026-06-01 - https://phabricator.wikimedia.org/T410849 [15:01:25] (03PS5) 10Effie Mouzeli: ProductionServices.php: switch filebackend.php to rdb2015:6381 #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299468 (https://phabricator.wikimedia.org/T418918) [15:01:36] (03CR) 10Effie Mouzeli: [C:03+2] netbox: switch to rdb1015 [puppet] - 10https://gerrit.wikimedia.org/r/1299519 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [15:02:04] !log brennen@deploy1003 Finished deploy [phabricator/deployment@d244a3e]: deploy phab2002 for T410849 (duration: 00m 45s) [15:02:29] (03CR) 10Slyngshede: [C:03+2] Permissions: Approvals are incorrectly compared [software/bitu] - 10https://gerrit.wikimedia.org/r/1299457 (owner: 10Slyngshede) [15:02:36] !log brennen@deploy1003 Started deploy [phabricator/deployment@d244a3e]: deploy phab1004 for T410849 [15:03:18] !log brennen@deploy1003 Finished deploy [phabricator/deployment@d244a3e]: deploy phab1004 for T410849 (duration: 00m 42s) [15:05:04] (03CR) 10BCornwall: [C:03+2] Rewrite VarnishHighThreadCount to trigger less [alerts] - 10https://gerrit.wikimedia.org/r/1298909 (owner: 10BCornwall) [15:06:03] (03PS1) 10Brouberol: airflow: export the CLASSPATH environment variable into the task-pod shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299525 (https://phabricator.wikimedia.org/T428294) [15:06:07] (03Merged) 10jenkins-bot: Permissions: Approvals are incorrectly compared [software/bitu] - 10https://gerrit.wikimedia.org/r/1299457 (owner: 10Slyngshede) [15:06:38] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:07:04] (03PS1) 10Brouberol: airflow: export the CLASSPATH environment variable into the task-pod shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299527 (https://phabricator.wikimedia.org/T428099) [15:07:15] (03CR) 10Clément Goubert: [C:03+1] ProductionServices.php: switch filebackend.php to rdb2015:6381 #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299468 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [15:07:24] (03CR) 10Brouberol: "`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299525 (https://phabricator.wikimedia.org/T428294) (owner: 10Brouberol) [15:07:38] (03CR) 10Clément Goubert: [C:03+1] mediawiki-common: add rdb1015 rdb1016 #1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299470 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [15:07:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:09:04] (03CR) 10Effie Mouzeli: "similar to Ib04fa522586ede6f83927abf094c86fc8e305469" [puppet] - 10https://gerrit.wikimedia.org/r/1299467 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [15:09:16] (03CR) 10Gehel: airflow: export the CLASSPATH environment variable into the task-pod shell (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299525 (https://phabricator.wikimedia.org/T428294) (owner: 10Brouberol) [15:09:44] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki-common: remove old rdb servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299447 (https://phabricator.wikimedia.org/T428561) (owner: 10Effie Mouzeli) [15:10:38] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:12:05] (03CR) 10Volans: cloud cumin: use ubuntu@ when reaching Trove database instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1299510 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [15:13:13] (03CR) 10Clément Goubert: [C:03+1] "Might as well so deployment-prep works if it's setup there?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299454 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [15:13:20] (03Merged) 10jenkins-bot: mediawiki-common: remove old rdb servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299447 (https://phabricator.wikimedia.org/T428561) (owner: 10Effie Mouzeli) [15:13:41] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki-common: add rdb1015 rdb1016 #1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299470 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [15:15:21] jouncebot: nowandnext [15:15:21] For the next 0 hour(s) and 44 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1500) [15:15:21] In 0 hour(s) and 44 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1600) [15:15:22] (03CR) 10Aleksandar Mastilovic: "Isn't `.bash_profile` a more common place for setting variables? This `.bashrc` seems to be configured for interactive shells (which expla" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299525 (https://phabricator.wikimedia.org/T428294) (owner: 10Brouberol) [15:15:54] would the SREs mind if I deploy https://gerrit.wikimedia.org/r/c/wikibase/termbox/+/1298789 now? cc jelto [15:16:23] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf2001.codfw.wmnet with OS trixie [15:16:59] (03Merged) 10jenkins-bot: mediawiki-common: add rdb1015 rdb1016 #1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299470 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [15:17:40] (03CR) 10Clément Goubert: [C:03+1] thumbor: change readiness probes to make surge recovery safer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298811 (https://phabricator.wikimedia.org/T357145) (owner: 10Hnowlan) [15:18:08] Lucas_WMDE: I do not have context :/ [15:18:22] (03PS2) 10RLazarus: admin: Add apdube to analytics-private-datausers [puppet] - 10https://gerrit.wikimedia.org/r/1298924 (https://phabricator.wikimedia.org/T427553) [15:18:45] effie: if you’re currently deploying, I can also do it later [15:18:57] * Lucas_WMDE doesn’t really know what this window is normally used for anyway [15:19:23] haha, it is an SRE window, you can go after me for certain [15:19:43] (03CR) 10RLazarus: [C:03+2] admin: Add apdube to analytics-private-datausers [puppet] - 10https://gerrit.wikimedia.org/r/1298924 (https://phabricator.wikimedia.org/T427553) (owner: 10RLazarus) [15:21:32] PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:21:41] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/admin 'sync'. [15:21:57] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [15:22:07] (03CR) 10Ottomata: [C:03+1] stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299523 (https://phabricator.wikimedia.org/T421237) (owner: 10JavierMonton) [15:22:35] (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299522 (https://phabricator.wikimedia.org/T421237) (owner: 10JavierMonton) [15:22:45] !log Remove `migrateMentorStatusAwayToCommunityConfiguration` from updatelog on all wikis (T409170; the script was only ever run as a dry-run) [15:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:49] T409170: Run MigrateMentorStatusAway migration script - https://phabricator.wikimedia.org/T409170 [15:22:50] (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299520 (https://phabricator.wikimedia.org/T421237) (owner: 10JavierMonton) [15:22:56] (03PS1) 10Atsuko: translate: adding separate read/write endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299529 (https://phabricator.wikimedia.org/T425377) [15:23:19] thanks! (currently waiting for the image to build) [15:24:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jiji@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299468 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [15:24:32] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:24:57] (03Merged) 10jenkins-bot: ProductionServices.php: switch filebackend.php to rdb2015:6381 #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299468 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [15:25:25] !log jiji@deploy1003 Started scap sync-world: Backport for [[gerrit:1299468|ProductionServices.php: switch filebackend.php to rdb2015:6381 #2 (T418918 T291916)]] [15:25:32] T418918: rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918 [15:25:33] T291916: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 [15:26:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to for  - https://phabricator.wikimedia.org/T427553#12000465 (10RLazarus) 05Open→03Resolved a:03RLazarus Done! Please wait up to 30 minutes for that to propagate to all servers, then you... [15:26:48] !log jiji@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [15:27:11] !log jiji@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [15:27:29] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc-wf2002.codfw.wmnet with OS trixie [15:28:56] !log jiji@deploy1003 Rolling back deployment [15:29:09] sigh Lucas_WMDE I need to fix something [15:30:36] np, still building [15:30:47] best of luck! [15:30:53] (03PS1) 10Effie Mouzeli: mediawiki-common: fix rdb server IP list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299530 [15:31:08] (03PS1) 10Elukey: WIP - docker_registry: introduce migration backends in Nginx [puppet] - 10https://gerrit.wikimedia.org/r/1299531 (https://phabricator.wikimedia.org/T428022) [15:31:52] (03CR) 10Blake: [C:03+1] mediawiki-common: fix rdb server IP list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299530 (owner: 10Effie Mouzeli) [15:32:03] (03CR) 10Kamila Součková: [C:03+1] mediawiki-common: fix rdb server IP list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299530 (owner: 10Effie Mouzeli) [15:32:43] !log brennen@deploy1003 Started deploy [phabricator/deployment@73e57ce]: deploy phab2002 for T410849 (followup for robots.txt) [15:32:46] !log jiji@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299468|ProductionServices.php: switch filebackend.php to rdb2015:6381 #2 (T418918 T291916)]] (duration: 07m 21s) [15:32:48] T410849: Update to Phorge/Arcanist upstream 2026-06-01 - https://phabricator.wikimedia.org/T410849 [15:32:53] T418918: rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918 [15:32:54] T291916: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 [15:33:29] !log brennen@deploy1003 Finished deploy [phabricator/deployment@73e57ce]: deploy phab2002 for T410849 (followup for robots.txt) (duration: 00m 45s) [15:33:47] !log brennen@deploy1003 Started deploy [phabricator/deployment@73e57ce]: deploy phab1004 for T410849 (followup for robots.txt) [15:33:49] (please refrain from running scap, it will fail anyway:) [15:34:13] oh dear CI is torture at this point [15:34:27] !log brennen@deploy1003 Finished deploy [phabricator/deployment@73e57ce]: deploy phab1004 for T410849 (followup for robots.txt) (duration: 00m 40s) [15:34:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#12000523 (10Dwisehaupt) [15:35:00] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki-common: fix rdb server IP list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299530 (owner: 10Effie Mouzeli) [15:35:00] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1299531 (https://phabricator.wikimedia.org/T428022) (owner: 10Elukey) [15:35:31] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:35:38] (03CR) 10Slyngshede: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1297125 (https://phabricator.wikimedia.org/T426809) (owner: 10Tiziano Fogli) [15:35:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#12000537 (10Dwisehaupt) 05Open→03Resolved Host built and testing in progress for trixie. Going to close this out and any new issues can be opened in new tasks. [15:36:18] (03CR) 10Elukey: [C:03+1] sloth: add editing task receiver [puppet] - 10https://gerrit.wikimedia.org/r/1299524 (https://phabricator.wikimedia.org/T428617) (owner: 10Tiziano Fogli) [15:38:18] (03Merged) 10jenkins-bot: mediawiki-common: fix rdb server IP list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299530 (owner: 10Effie Mouzeli) [15:38:31] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:40:47] (03PS1) 10Lucas Werkmeister (WMDE): termbox: update to 2026-06-09-152845-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299533 (https://phabricator.wikimedia.org/T321316) [15:43:47] !log brouberol@cumin1003 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on A:cephosd-eqiad [15:45:28] !log jiji@deploy1003 Started scap sync-world: redeploy 1299468 [15:46:03] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf2002.codfw.wmnet with reason: host reimage [15:46:42] !log jiji@deploy1003 jiji: redeploy 1299468 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:47:07] a scap without a spiderpig?? 🤯 [15:47:13] !log jiji@deploy1003 jiji: Continuing with deployment [15:47:34] Lucas_WMDE: I wish it were different [15:48:12] oh dear, https://spiderpig.wikimedia.org/jobs/2222 looks scary /o\ [15:48:35] hahaha it was just a typo in the allow list of some new IPs [15:49:15] but yes all it looks absolutely scary [15:49:29] (03PS2) 10Elukey: WIP - docker_registry: introduce migration backends in Nginx [puppet] - 10https://gerrit.wikimedia.org/r/1299531 (https://phabricator.wikimedia.org/T428022) [15:49:56] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf2002.codfw.wmnet with reason: host reimage [15:50:12] (03PS1) 10Cathal Mooney: rancid: add "show version" to commands for SR Linux switches [puppet] - 10https://gerrit.wikimedia.org/r/1299537 [15:50:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299454 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [15:51:31] (03CR) 10CI reject: [V:04-1] WIP - docker_registry: introduce migration backends in Nginx [puppet] - 10https://gerrit.wikimedia.org/r/1299531 (https://phabricator.wikimedia.org/T428022) (owner: 10Elukey) [15:51:34] !log jiji@deploy1003 Finished scap sync-world: redeploy 1299468 (duration: 07m 23s) [15:51:50] * effie takes a breath [15:51:58] Lucas_WMDE: all yours [15:52:12] :o [15:52:13] thanks! [15:52:34] do I need a +1/+2 by someone else on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1299533 (version bump)? or is it okay if I self-merge those? I’m never sure [15:53:07] I can +1 [15:53:10] thanks <3 [15:53:14] (03CR) 10Effie Mouzeli: [C:03+1] termbox: update to 2026-06-09-152845-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299533 (https://phabricator.wikimedia.org/T321316) (owner: 10Lucas Werkmeister (WMDE)) [15:53:21] I don’t think there’s much to review other than “i deploy it and roll back if it breaks”) [15:53:22] (03PS3) 10Elukey: WIP - docker_registry: introduce migration backends in Nginx [puppet] - 10https://gerrit.wikimedia.org/r/1299531 (https://phabricator.wikimedia.org/T428022) [15:53:32] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] termbox: update to 2026-06-09-152845-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299533 (https://phabricator.wikimedia.org/T321316) (owner: 10Lucas Werkmeister (WMDE)) [15:53:48] !log jiji@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [15:54:27] Lucas_WMDE: ever since spiderpig refused to let me read logs, i opted to SSH too :D [15:54:30] !log jiji@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [15:54:55] wait what [15:55:11] Lucas_WMDE: it crashed when i opened the job, because there was A LOT OF LOGS [15:55:21] ./o\ [15:55:24] (03PS4) 10Elukey: WIP - docker_registry: introduce migration backends in Nginx [puppet] - 10https://gerrit.wikimedia.org/r/1299531 (https://phabricator.wikimedia.org/T428022) [15:55:51] (03Merged) 10jenkins-bot: termbox: update to 2026-06-09-152845-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299533 (https://phabricator.wikimedia.org/T321316) (owner: 10Lucas Werkmeister (WMDE)) [15:56:01] (it wasn't a pleasant deployment experience) [15:56:06] I _think_ it got fixed, but... [15:56:26] Lucas_WMDE: T424975 if you're curious :) [15:56:26] T424975: Certain deployment logs cause Spiderpig to crash the browser - https://phabricator.wikimedia.org/T424975 [15:56:34] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/termbox: apply [15:57:04] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/termbox: apply [15:57:14] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [15:57:21] “Pass --context 5 to helmfile apply” aaaahaha yeah I see how *not* doing that could cause problems [15:57:42] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1299531 (https://phabricator.wikimedia.org/T428022) (owner: 10Elukey) [15:57:48] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [15:57:59] (I thought T422162 might have helped in the meantime by making the console more lightweight but that was actually before that task, so nevermind) [15:57:59] T422162: Cannot copy out of SpiderPig console - https://phabricator.wikimedia.org/T422162 [15:58:15] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/termbox: apply [15:58:17] yeah, my browser just full-crashed whenever i tried opening the console [15:58:58] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/termbox: apply [15:59:04] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/termbox: apply [15:59:05] !log jiji@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [15:59:07] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Setting log level to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299485 (owner: 10Santiago Faci) [15:59:29] !log jiji@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [15:59:43] !log jiji@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [15:59:50] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [16:00:05] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1600). [16:00:05] Msz2001, revi, and Dreamy_Jazz: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:12] \o [16:00:12] o/ [16:00:17] \o/ [16:00:23] looks like I finished deploying just in time \o/ [16:00:27] !log jiji@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [16:00:54] o/ [16:01:04] * revi worried if he should reschedule lol [16:01:19] (03Merged) 10jenkins-bot: Test Kitchen UI: Setting log level to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299485 (owner: 10Santiago Faci) [16:01:35] effie: I’m all done, thank you! [16:01:49] \m/ [16:02:24] !log jiji@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [16:02:34] and just realized `s/worried/was worried` :-p [16:02:37] !log jiji@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [16:02:57] (03PS2) 10Cathal Mooney: rancid: add "show version" to commands for SR Linux switches [puppet] - 10https://gerrit.wikimedia.org/r/1299537 [16:04:31] (03CR) 10Tiziano Fogli: [C:03+2] liberica: disable pint check for missing metrics [alerts] - 10https://gerrit.wikimedia.org/r/1297125 (https://phabricator.wikimedia.org/T426809) (owner: 10Tiziano Fogli) [16:04:55] revi: so your patch is the same destination, but hits the english title? [16:05:01] yup [16:05:02] (03CR) 10Tiziano Fogli: [C:03+2] sloth: add editing task receiver [puppet] - 10https://gerrit.wikimedia.org/r/1299524 (https://phabricator.wikimedia.org/T428617) (owner: 10Tiziano Fogli) [16:05:16] and then the english title will become the canonical name for that page [16:05:38] (as of now, eng redirects to kor, the plan is to make kor redirect to eng) [16:06:32] nod, okay thanks, merging all three patches in [16:06:37] (03CR) 10JHathaway: [C:03+2] Periodic jobs: add demote_ineligible_users (and _central_ counterpart) [puppet] - 10https://gerrit.wikimedia.org/r/1285315 (https://phabricator.wikimedia.org/T425396) (owner: 10Mszwarc) [16:06:47] (03CR) 10JHathaway: [C:03+2] Change kr.wikimedia destination [puppet] - 10https://gerrit.wikimedia.org/r/1298381 (https://phabricator.wikimedia.org/T428327) (owner: 10Revi) [16:06:51] (03PS2) 10Filippo Giunchedi: icinga: remove toolschecker-based checks [puppet] - 10https://gerrit.wikimedia.org/r/1298742 (https://phabricator.wikimedia.org/T313030) [16:06:51] (03PS1) 10Filippo Giunchedi: etcd: make etcdctl work out of the box [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) [16:06:53] (03PS1) 10Filippo Giunchedi: toolforge: remove checker access from k8s::etcd [puppet] - 10https://gerrit.wikimedia.org/r/1299546 (https://phabricator.wikimedia.org/T313030) [16:06:56] (03PS1) 10Filippo Giunchedi: Remove toolschecker role/profile [puppet] - 10https://gerrit.wikimedia.org/r/1299547 (https://phabricator.wikimedia.org/T313030) [16:07:01] (03CR) 10JHathaway: [C:03+2] alertmanager: Reroute TSP alerts to PSI alerts channel [puppet] - 10https://gerrit.wikimedia.org/r/1299420 (owner: 10Dreamy Jazz) [16:07:39] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf2002.codfw.wmnet with OS trixie [16:07:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:08:30] Msz2001, revi, and Dreamy_Jazz, changes are merged in [16:08:32] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:08:37] Thanks [16:08:45] thx [16:09:08] thanks [16:09:33] !log jiji@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [16:10:10] !log jiji@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [16:11:11] (03PS1) 10Atsuko: ElasticSearchTtmServer: drop include_type_name and support int replicas [extensions/Translate] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1299556 (https://phabricator.wikimedia.org/T428168) [16:12:04] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc-wf1001.eqiad.wmnet with OS trixie [16:13:07] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:13:11] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:14:44] (03CR) 10Effie Mouzeli: "similar to Icc89f4aef6c021318acd4072ae72efa9058eb451" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299462 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:14:56] (03CR) 10Effie Mouzeli: "similar to Icc89f4aef6c021318acd4072ae72efa9058eb451" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299460 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:15:44] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:16:07] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:16:17] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299520 (https://phabricator.wikimedia.org/T421237) (owner: 10JavierMonton) [16:16:20] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299522 (https://phabricator.wikimedia.org/T421237) (owner: 10JavierMonton) [16:16:22] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299523 (https://phabricator.wikimedia.org/T421237) (owner: 10JavierMonton) [16:16:31] (03PS1) 10Atsuko: ElasticSearchTtmServer: clean stale _doc usage and version error output [extensions/Translate] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1299561 (https://phabricator.wikimedia.org/T428168) [16:18:00] (03CR) 10Blake: [C:03+1] changeprop-jobqueue: switch to rdb1015 #5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299462 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:18:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Translate] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1299556 (https://phabricator.wikimedia.org/T428168) (owner: 10Atsuko) [16:18:30] (03CR) 10Blake: [C:03+1] changeprop: switch to rdb1015 #4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299460 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:18:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Translate] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1299556 (https://phabricator.wikimedia.org/T428168) (owner: 10Atsuko) [16:19:11] (03CR) 10Effie Mouzeli: [C:03+2] changeprop: switch to rdb1015 #4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299460 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:19:33] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:19:36] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299520 (https://phabricator.wikimedia.org/T421237) (owner: 10JavierMonton) [16:19:37] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:19:51] (03Merged) 10jenkins-bot: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299522 (https://phabricator.wikimedia.org/T421237) (owner: 10JavierMonton) [16:19:54] (03Merged) 10jenkins-bot: stream: mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299523 (https://phabricator.wikimedia.org/T421237) (owner: 10JavierMonton) [16:19:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Translate] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1299561 (https://phabricator.wikimedia.org/T428168) (owner: 10Atsuko) [16:20:15] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:20:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299529 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [16:21:46] (03Merged) 10jenkins-bot: changeprop: switch to rdb1015 #4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299460 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:22:41] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:23:16] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:25:27] (03CR) 10Effie Mouzeli: [C:03+2] changeprop-jobqueue: switch to rdb1015 #5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299462 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:26:05] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf1001.eqiad.wmnet with reason: host reimage [16:27:37] (03PS15) 10Elukey: redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [16:28:23] (03Merged) 10jenkins-bot: changeprop-jobqueue: switch to rdb1015 #5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299462 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:28:37] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:29:11] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:30:24] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf1001.eqiad.wmnet with reason: host reimage [16:30:39] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:31:15] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:31:17] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:31:19] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:31:23] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [16:33:08] (03PS3) 10Bvibber: Bump opcache memory limits for LCStoreStaticArray [puppet] - 10https://gerrit.wikimedia.org/r/1281779 (https://phabricator.wikimedia.org/T99740) [16:33:23] (03CR) 10Dzahn: [C:03+2] gerrit: adjust path to gc_log [puppet] - 10https://gerrit.wikimedia.org/r/1298932 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [16:33:29] (03PS2) 10Dzahn: gerrit: adjust path to gc_log [puppet] - 10https://gerrit.wikimedia.org/r/1298932 (https://phabricator.wikimedia.org/T425667) [16:33:33] (03CR) 10Clément Goubert: [C:03+1] redioscope: switch to rdb1015 #7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299459 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:33:44] (03CR) 10Clément Goubert: [C:03+1] ratelimit: switch to rdb1015 #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299464 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:33:48] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: switch to rdb1015 #8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299466 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:34:31] (03CR) 10Ayounsi: [C:03+1] "worth a try!" [puppet] - 10https://gerrit.wikimedia.org/r/1299537 (owner: 10Cathal Mooney) [16:34:38] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:43] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2008.codfw.wmnet with OS trixie [16:35:33] !log jasmine@cumin2002 START - Cookbook sre.hosts.move-vlan for host kafka-main2008 [16:36:46] (03CR) 10Dzahn: [C:03+2] gerrit: adjust path to gc_log [puppet] - 10https://gerrit.wikimedia.org/r/1298932 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [16:36:47] (03CR) 10Effie Mouzeli: [C:03+2] ratelimit: switch to rdb1015 #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299464 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:37:21] (03CR) 10Dzahn: [C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1298932 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [16:38:12] (03CR) 10Dzahn: [C:03+2] gerrit: ensure error_log.json, sshd_log.json are always shipped to ELK [puppet] - 10https://gerrit.wikimedia.org/r/1298931 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [16:38:36] jasmine@cumin2002 reimage (PID 1468399) is awaiting input [16:39:06] (03Merged) 10jenkins-bot: ratelimit: switch to rdb1015 #6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299464 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:39:38] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:39:48] (03PS2) 10Dzahn: gerrit: flip direction of symlink for log directories [puppet] - 10https://gerrit.wikimedia.org/r/1298938 (https://phabricator.wikimedia.org/T425667) [16:41:12] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [16:41:17] (03CR) 10Xcollazo: stream: mediawiki.user_change (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299422 (https://phabricator.wikimedia.org/T423952) (owner: 10JavierMonton) [16:41:35] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [16:42:33] (03CR) 10Effie Mouzeli: [C:03+2] redioscope: switch to rdb1015 #7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299459 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:42:57] (03PS1) 10Jasmine: hieradata/common.yaml: add new IPs for kafka-main2008, following vlan migrations [puppet] - 10https://gerrit.wikimedia.org/r/1299570 (https://phabricator.wikimedia.org/T427088) [16:44:16] (03CR) 10Cathal Mooney: [C:03+2] rancid: add "show version" to commands for SR Linux switches [puppet] - 10https://gerrit.wikimedia.org/r/1299537 (owner: 10Cathal Mooney) [16:45:00] (03Merged) 10jenkins-bot: redioscope: switch to rdb1015 #7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299459 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:45:21] (03CR) 10JMeybohm: [C:03+1] hieradata/common.yaml: add new IPs for kafka-main2008, following vlan migrations [puppet] - 10https://gerrit.wikimedia.org/r/1299570 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [16:45:40] (03PS1) 10Santiago Faci: Test Kitchen UI: Restore log level to default value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299573 [16:47:25] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:47:25] !log jiji@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/redioscope: apply [16:47:43] !log jiji@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/redioscope: apply [16:47:55] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf1001.eqiad.wmnet with OS trixie [16:48:15] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:48:56] PROBLEM - Ensure traffic_manager is running for instance backend on cp4041 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:49:19] (03PS1) 10Jasmine: Revert^2 "kafka-main2008: apply host-level override in advance of trixie upgrade [0]" [puppet] - 10https://gerrit.wikimedia.org/r/1299574 [16:49:32] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.4.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299575 (https://phabricator.wikimedia.org/T427976) [16:49:56] RECOVERY - Ensure traffic_manager is running for instance backend on cp4041 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:50:22] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc-wf1002.eqiad.wmnet with OS trixie [16:51:02] (03CR) 10JMeybohm: [C:03+1] Revert^2 "kafka-main2008: apply host-level override in advance of trixie upgrade [0]" [puppet] - 10https://gerrit.wikimedia.org/r/1299574 (owner: 10Jasmine) [16:52:03] (03CR) 10Effie Mouzeli: [C:03+2] rest-gateway: switch to rdb1015 #8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299466 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:52:35] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:52:42] (03CR) 10CI reject: [V:04-1] Revert^2 "kafka-main2008: apply host-level override in advance of trixie upgrade [0]" [puppet] - 10https://gerrit.wikimedia.org/r/1299574 (owner: 10Jasmine) [16:53:19] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [16:53:51] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.4.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299575 (https://phabricator.wikimedia.org/T427976) (owner: 10Santiago Faci) [16:54:31] (03PS2) 10Jasmine: Revert^2 "kafka-main2008: apply host-level override [0]" [puppet] - 10https://gerrit.wikimedia.org/r/1299574 [16:55:45] (03CR) 10Jasmine: [C:03+2] Revert^2 "kafka-main2008: apply host-level override [0]" [puppet] - 10https://gerrit.wikimedia.org/r/1299574 (owner: 10Jasmine) [16:55:55] (03CR) 10Jasmine: [C:03+2] hieradata/common.yaml: add new IPs for kafka-main2008, following vlan migrations [puppet] - 10https://gerrit.wikimedia.org/r/1299570 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [16:55:59] (03CR) 10JavierMonton: [C:03+2] "This is merged but it is a bit late (for me) to deploy it and ensure everything works well. I'll do the deployment tomorrow morning." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299523 (https://phabricator.wikimedia.org/T421237) (owner: 10JavierMonton) [16:56:23] (03CR) 10Clément Goubert: "These changes would only apply to beta, we don't have appservers anymore in production but keep the code around for it." [puppet] - 10https://gerrit.wikimedia.org/r/1281779 (https://phabricator.wikimedia.org/T99740) (owner: 10Bvibber) [16:56:25] (03Merged) 10jenkins-bot: rest-gateway: switch to rdb1015 #8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299466 (https://phabricator.wikimedia.org/T418918) (owner: 10Effie Mouzeli) [16:56:34] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.4.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299575 (https://phabricator.wikimedia.org/T427976) (owner: 10Santiago Faci) [16:57:02] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:57:06] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [16:57:11] !log jasmine@cumin2002 START - Cookbook sre.dns.netbox [16:58:29] 06SRE, 10ServiceOps-Upgrades-Hardware, 13Patch-For-Review, 06ServiceOps new (Next quarter): rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#12001194 (10jijiki) [16:58:40] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:58:56] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1700) [17:00:15] (03CR) 10Clément Goubert: "Realizing I forgot to point out where that's configured now:" [puppet] - 10https://gerrit.wikimedia.org/r/1281779 (https://phabricator.wikimedia.org/T99740) (owner: 10Bvibber) [17:03:42] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5018.eqsin.wmnet with OS trixie [17:04:07] !log jasmine@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-main2008 - jasmine@cumin2002" [17:04:14] !log brett@cumin2002 START - Cookbook sre.hosts.move-vlan for host cp5018 [17:04:16] !log jasmine@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-main2008 - jasmine@cumin2002" [17:04:16] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:04:16] !log jasmine@cumin2002 START - Cookbook sre.dns.wipe-cache kafka-main2008.codfw.wmnet 4.32.192.10.in-addr.arpa 4.0.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:04:19] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [17:04:20] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kafka-main2008.codfw.wmnet 4.32.192.10.in-addr.arpa 4.0.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:04:21] !log jasmine@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-main2008 [17:04:52] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [17:05:26] !log jasmine@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-main2008 [17:05:26] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-main2008 [17:06:49] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf1002.eqiad.wmnet with reason: host reimage [17:07:18] brett@cumin2002 reimage (PID 1474637) is awaiting input [17:07:23] (03PS1) 10BCornwall: common: Update cp5018's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1299579 (https://phabricator.wikimedia.org/T428229) [17:07:32] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:10:39] (03PS14) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [17:10:39] (03PS6) 10Andrew Bogott: cloud cumin: use ubuntu@ when reaching Trove database instances [puppet] - 10https://gerrit.wikimedia.org/r/1299510 (https://phabricator.wikimedia.org/T422801) [17:10:41] (03CR) 10Andrew Bogott: cloud cumin: use ubuntu@ when reaching Trove database instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1299510 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [17:14:13] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf1002.eqiad.wmnet with reason: host reimage [17:17:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [17:21:15] (03CR) 10BCornwall: [C:03+1] Add wikikube-ctrl2004 and wikikube-ctrl2005 to codfw K8S NS entries [dns] - 10https://gerrit.wikimedia.org/r/1299440 (owner: 10Cathal Mooney) [17:32:12] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf1002.eqiad.wmnet with OS trixie [17:34:57] 10ops-codfw, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Datastores: kafka-main2008 CPU 2 MEMABCD VPP PG voltage is outside of range. - https://phabricator.wikimedia.org/T428654 (10JMeybohm) 03NEW [17:35:35] (03CR) 10Catrope: [C:03+1] Add 2FA enforcement demotion config for phase 3 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298890 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [17:43:38] !log kafka-main2008 is down due to hardware failure T428654 [17:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:42] T428654: kafka-main2008 CPU 2 MEMABCD VPP PG voltage is outside of range. - https://phabricator.wikimedia.org/T428654 [17:43:55] (03CR) 10Ottomata: stream: mediawiki.user_change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299422 (https://phabricator.wikimedia.org/T423952) (owner: 10JavierMonton) [17:45:07] (03CR) 10Ssingh: [V:03+1 C:03+2] hcaptcha: Allow the wikisource.org bare domain in frame-ancestors CSP [puppet] - 10https://gerrit.wikimedia.org/r/1299427 (https://phabricator.wikimedia.org/T428539) (owner: 10Kosta Harlan) [17:46:04] !log sudo cumin 'A:hcaptcha-proxy' 'run-puppet-agent': rolling out CR 1299427 T428539 [17:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:09] T428539: Cannot publish pages on https://wikisource.org due to hcaptcha Content Security Policy - https://phabricator.wikimedia.org/T428539 [17:46:28] (03CR) 10Ssingh: "Nice!" [alerts] - 10https://gerrit.wikimedia.org/r/1298909 (owner: 10BCornwall) [17:46:48] (03CR) 10Cathal Mooney: [C:03+2] Add wikikube-ctrl2004 and wikikube-ctrl2005 to codfw K8S NS entries [dns] - 10https://gerrit.wikimedia.org/r/1299440 (owner: 10Cathal Mooney) [17:46:59] 10ops-codfw, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Datastores: kafka-main2008 CPU 2 MEMABCD VPP PG voltage is outside of range. - https://phabricator.wikimedia.org/T428654#12001460 (10RobH) p:05Triage→03High [17:47:05] !log cmooney@dns2005 START - running authdns-update [17:47:45] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases2003.codfw.wmnet with reason: reimage [17:48:09] 10ops-codfw, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Datastores: kafka-main2008 CPU 2 MEMABCD VPP PG voltage is outside of range. - https://phabricator.wikimedia.org/T428654#12001488 (10RobH) [17:48:26] !log cmooney@dns2005 END - running authdns-update [17:48:51] !log https://releases.wikimedia.org | https://releases-jenkins.wikimedia.org - down for maintenance T418299 [17:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:55] T418299: upgrade releases hosts to trixie - https://phabricator.wikimedia.org/T418299 [17:50:33] (03CR) 10Ssingh: "Yeah it's totally fine. If thumb.wikimedia.org will have other records with (such as the donate MX), use DYNA. If not, use a CNAME to dyna" [dns] - 10https://gerrit.wikimedia.org/r/1298821 (https://phabricator.wikimedia.org/T427465) (owner: 10Ladsgroup) [17:52:18] dzahn@cumin2002 reimage (PID 1482983) is awaiting input [17:52:36] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#12001497 (10ssingh) Update: Based on the discussion in the Traffic meeting, we would like to pursue the option of getting the additional CPU in July 2026. CC @wiki_willy and @RobH -- please let us kno... [17:53:50] (03PS4) 10Bvibber: Bump opcache memory limits for LCStoreStaticArray [puppet] - 10https://gerrit.wikimedia.org/r/1281779 (https://phabricator.wikimedia.org/T428655) [17:54:49] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#12001508 (10RobH) >>! In T414411#12001497, @ssingh wrote: > Update: Based on the discussion in the Traffic meeting, we would like to pursue the option of getting the additional CPU in July 2026. CC @w... [18:00:04] dduvall and jnuche: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T1800). [18:02:07] 06SRE, 10DNS, 06Traffic: 10.67.28.73 reverse DNS showing 2(SERVFAIL) - https://phabricator.wikimedia.org/T428573#12001514 (10cmooney) Seems the IP is in use for //mediawiki-dumps-legacy//: ` cmooney@dse-k8s-ctrl1001:~$ sudo kubectl get pods -o wide --all-namespaces | grep 10.67.28.73 mediawiki-dumps-legacy... [18:04:16] (03PS1) 10Bvibber: Bump opcache.interned_strings_buffer for LCStoreStaticArray [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299583 (https://phabricator.wikimedia.org/T428655) [18:07:32] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:38] o/ [18:11:22] (03CR) 10Brouberol: "`.bash_profile` is for login shells, whereas this was solely setup for users exec-ing into the `task-shell` pod, to be able to interact wi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299525 (https://phabricator.wikimedia.org/T428294) (owner: 10Brouberol) [18:12:10] (03PS2) 10Brouberol: airflow: export the CLASSPATH environment variable into the task-pod shell [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299525 (https://phabricator.wikimedia.org/T428099) [18:12:12] (03CR) 10Brouberol: airflow: export the CLASSPATH environment variable into the task-pod shell (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299525 (https://phabricator.wikimedia.org/T428099) (owner: 10Brouberol) [18:12:33] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1299579 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [18:14:00] (03PS1) 10Dzahn: releases: remove outdated comments about releases-jenkins in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1299585 (https://phabricator.wikimedia.org/T330960) [18:15:16] (03CR) 10Brouberol: "If you want this env var to be setup for all pods using the airflow image, I think we should define it in the image entrypoint itself." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299525 (https://phabricator.wikimedia.org/T428099) (owner: 10Brouberol) [18:15:42] (03CR) 10Dzahn: "Hey Jaime, basically just confirming that the claim "releases-jenkins does not work in codfw yet" is not true. It seems like it can't be b" [puppet] - 10https://gerrit.wikimedia.org/r/1299585 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [18:16:02] (03CR) 10Aleksandar Mastilovic: "Well not necessarily all pods, but at least the ones running tasks executing Python code? The reason we went to the shell was because an `" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1299525 (https://phabricator.wikimedia.org/T428099) (owner: 10Brouberol) [18:16:25] (03CR) 10BCornwall: [C:03+1] wmf-config: Update private subnets to include additions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297237 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [18:23:36] (03PS1) 10TrainBranchBot: group0 to 1.47.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299587 (https://phabricator.wikimedia.org/T423915) [18:23:39] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299587 (https://phabricator.wikimedia.org/T423915) (owner: 10TrainBranchBot) [18:24:35] (03Merged) 10jenkins-bot: group0 to 1.47.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299587 (https://phabricator.wikimedia.org/T423915) (owner: 10TrainBranchBot) [18:25:35] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.24.0-a8 [vendor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299588 (https://phabricator.wikimedia.org/T378906) [18:26:10] !log jasmine@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2008.codfw.wmnet with OS trixie [18:26:18] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.24.0-a8 [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299589 (https://phabricator.wikimedia.org/T428270) [18:26:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299589 (https://phabricator.wikimedia.org/T428270) (owner: 10C. Scott Ananian) [18:27:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [vendor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299588 (https://phabricator.wikimedia.org/T378906) (owner: 10C. Scott Ananian) [18:28:40] 10ops-codfw, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Datastores: kafka-main2008 CPU 2 MEMABCD VPP PG voltage is outside of range. - https://phabricator.wikimedia.org/T428654#12001627 (10Jhancock.wm) unplugged the PSUs and drained the flea power. had it PXE boot at Jayme's request. Reboot fixed it, but if... [18:29:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#12001628 (10Jclark-ctr) a:05Dwisehaupt→03Jclark-ctr [18:29:23] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2008.codfw.wmnet with OS trixie [18:30:59] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.47.0-wmf.6 refs T423915 [18:31:04] T423915: 1.47.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T423915 [18:31:06] (03PS2) 10Dzahn: releases: remove outdated comments about releases-jenkins in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1299585 (https://phabricator.wikimedia.org/T330960) [18:34:01] (03PS1) 10Dzahn: releases: switch active backend from codfw to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1299590 (https://phabricator.wikimedia.org/T330960) [18:35:50] (03PS1) 10Dzahn: swich releases.discovery.wmnet from codfw to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1299591 (https://phabricator.wikimedia.org/T330960) [18:42:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:42:22] (03CR) 10CI reject: [V:04-1] Bump wikimedia/parsoid to 0.24.0-a8 [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299589 (https://phabricator.wikimedia.org/T428270) (owner: 10C. Scott Ananian) [18:42:33] 07sre-alert-triage, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Alert in need of triage: ResourceQuotaMemoryLimitsWarning - https://phabricator.wikimedia.org/T426589#12001682 (10RKemper) a:03RKemper [18:42:46] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:43:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:43:36] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:46:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:46:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:47:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:47:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:47:52] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [18:50:36] !log jasmine@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [18:50:57] !log jasmine@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [18:51:05] !log jasmine@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [18:51:46] !log jasmine@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:51:55] !log jasmine@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [18:51:57] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2008.codfw.wmnet with reason: host reimage [18:52:22] !log jasmine@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [18:52:30] !log jasmine@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:52:35] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dse-k8s-wdqs2003 to codfw - jhancock@cumin2002" [18:52:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dse-k8s-wdqs2003 to codfw - jhancock@cumin2002" [18:52:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:53:19] !log jasmine@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:53:26] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [18:53:55] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [18:54:03] !log jasmine@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [18:54:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:54:53] !log jasmine@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [18:55:01] !log jasmine@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [18:55:49] !log jasmine@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [18:55:58] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:56:27] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:56:34] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [18:57:10] 06SRE, 10Wikimedia-Mailing-lists: Create mail list for Wikimedia Community User Group Cyprus - https://phabricator.wikimedia.org/T428525#12001731 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I created it as a public mailing list, if you wanted a private one, you can change the settings yourself: htt... [18:57:24] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [18:57:31] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [18:58:01] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:58:08] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [18:58:41] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2008.codfw.wmnet with reason: host reimage [18:58:59] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [18:59:38] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 not rebooting - https://phabricator.wikimedia.org/T428542#12001745 (10Jclark-ctr) 05Open→03Resolved If error returns by friday please reopen ticket [19:00:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-wdqs2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:00:23] (03PS1) 10Jdlrobson: Revert "Create VectorComponentPageToolbar component" [skins/Vector] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299602 (https://phabricator.wikimedia.org/T428649) [19:03:09] (03PS1) 10Ryan Kemper: wdqs: decom wdqs1015 [puppet] - 10https://gerrit.wikimedia.org/r/1299606 (https://phabricator.wikimedia.org/T427852) [19:03:40] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1299606 (https://phabricator.wikimedia.org/T427852) (owner: 10Ryan Kemper) [19:11:34] (03PS2) 10Ryan Kemper: wdqs: decom wdqs1015 [puppet] - 10https://gerrit.wikimedia.org/r/1299606 (https://phabricator.wikimedia.org/T428582) [19:12:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [19:12:22] !log jasmine@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [19:12:44] !log jasmine@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [19:13:18] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Datastores: kafka-main2008 CPU 2 MEMABCD VPP PG voltage is outside of range. - https://phabricator.wikimedia.org/T428654#12001800 (10JMeybohm) 05Open→03Resolved a:03JMeybohm >>! In T428654#12001627, @Jhancock.wm wrote: > unplugged th... [19:13:49] (03CR) 10Ryan Kemper: [C:03+2] wdqs: decom wdqs1015 [puppet] - 10https://gerrit.wikimedia.org/r/1299606 (https://phabricator.wikimedia.org/T428582) (owner: 10Ryan Kemper) [19:15:07] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts wdqs1015.eqiad.wmnet [19:15:50] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2008.codfw.wmnet with OS trixie [19:20:26] (03PS1) 10Bartosz Dziewoński: Add my public key to mediawiki.org/keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299614 [19:20:37] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [19:26:06] (03CR) 10Bartosz Dziewoński: "* Is it okay that my key expires in about 2 months? Should I create a new one for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299614 (owner: 10Bartosz Dziewoński) [19:26:37] ryankemper@cumin2002 decommission (PID 1502974) is awaiting input [19:27:53] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1015.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [19:28:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1015.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [19:28:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:28:25] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wdqs1015.eqiad.wmnet [19:35:29] (03PS2) 10Bartosz Dziewoński: Add my public key to mediawiki.org/keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299614 (https://phabricator.wikimedia.org/T423267) [19:40:15] (03CR) 10BCornwall: [C:03+1] swich releases.discovery.wmnet from codfw to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1299591 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [19:43:02] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): decommission wdqs1015.eqiad.wmnet - https://phabricator.wikimedia.org/T428582#12001937 (10Jclark-ctr) 05In progress→03Resolved [19:45:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:45:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:46:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:46:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:47:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:47:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:48:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:53:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:54:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:59:29] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T2000). [20:00:05] apaskulin and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] hi! [20:04:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:06:56] o/ [20:07:15] i don't know who the official deployer for this window is, but i can spiderpig deploy [20:07:46] apaskulin can you test your config change if I deploy it for you? [20:07:59] yes! thank you that would be great [20:08:12] ok let's get started then [20:08:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299454 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [20:08:33] it only affects test wiki, so it's low risk [20:09:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:09:44] (03Merged) 10jenkins-bot: wgRestSandboxSpecs: Add lift-wing spec pointing to api.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299454 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [20:10:12] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1299454|wgRestSandboxSpecs: Add lift-wing spec pointing to api.wikimedia.org (T427902)]] [20:10:16] T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files. - https://phabricator.wikimedia.org/T427902 [20:12:15] !log cscott@deploy1003 cscott, gkyziridis: Backport for [[gerrit:1299454|wgRestSandboxSpecs: Add lift-wing spec pointing to api.wikimedia.org (T427902)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:12:39] apaskulin: ok, ready to test [20:12:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:13:36] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-wdqs2001 [20:13:38] _testing_ [20:13:43] 10ops-ulsfo, 06DC-Ops: asw1-22-ulsfo:PSU1 down - https://phabricator.wikimedia.org/T428678 (10ayounsi) 03NEW p:05Triage→03High [20:13:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-wdqs2001 [20:13:49] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-wdqs2002 [20:14:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-wdqs2002 [20:14:03] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-wdqs2003 [20:14:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-wdqs2003 [20:14:17] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-wdqs2004 [20:14:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-wdqs2004 [20:15:35] 10ops-ulsfo, 06DC-Ops: asw1-22-ulsfo:PSU1 down - https://phabricator.wikimedia.org/T428678#12002145 (10RobH) a:03RobH [20:16:13] hmmm I'm not seeing it. I would expect to see Lift Wing API in the dropdown on https://test.wikipedia.org/w/index.php?api=mw-extra&title=Special%3ARestSandbox [20:16:30] the page itself seems to still be working normally [20:16:42] do you have the x-wikimedia-debug extension turned on (i keep forgetting to enable it when testing) [20:17:38] Just turned it on, just need to look up how to access its output [20:17:44] jhancock@cumin2002 provision (PID 1515434) is awaiting input [20:18:17] ok! Lift Wing just appeared in the dropdown [20:18:34] it's throwing a CORS error, which we'll need to look into [20:18:41] https://test.wikipedia.org/w/index.php?api=lift-wing&title=Special%3ARestSandbox [20:18:51] i see it i think https://usercontent.irccloud-cdn.com/file/2N9PhSQU/image.png [20:19:09] same! [20:19:31] apaskulin: take more time if you need to investigate the cors issue, i'm not in a hurry. [20:23:51] 10ops-ulsfo, 06DC-Ops: asw1-22-ulsfo:PSU1 down - https://phabricator.wikimedia.org/T428678#12002168 (10RobH) IRC summary: * icinga doesn't alert for this, netops to handle action item to fix icinga alerting on nokia (not working) versus juniper (works) * netbox doesn't have the power cord ids or their landing... [20:24:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:25:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:25:46] I'm looking into it, but I'm not sure what's going on. Even though it's erroring, it doesn't seem to be impacting any other functionality, so if you think it's ok, I think my preference would be to leave it as is to better troubleshoot when others are back online tomorrow [20:27:58] ok, i'll finish rolling out the deploy. we can always roll back later if we need to. [20:28:02] !log cscott@deploy1003 cscott, gkyziridis: Continuing with deployment [20:28:04] thanks! [20:28:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-wdqs2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:30:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:32:20] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299454|wgRestSandboxSpecs: Add lift-wing spec pointing to api.wikimedia.org (T427902)]] (duration: 22m 08s) [20:32:25] T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files. - https://phabricator.wikimedia.org/T427902 [20:32:50] it's working now for me https://test.wikipedia.org/w/index.php?api=lift-wing&title=Special%3ARestSandbox huh [20:33:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:33:25] i was a little suspicious that the CORS issues might resolve itself once it was fully in production and not just the test servers, but i don't understand CORS well enough to know if/why. [20:33:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:33:33] oh weird, it was Wikimedia Debug that was causing it to error [20:33:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299588 (https://phabricator.wikimedia.org/T378906) (owner: 10C. Scott Ananian) [20:34:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299589 (https://phabricator.wikimedia.org/T428270) (owner: 10C. Scott Ananian) [20:34:01] CORS is mysterious [20:36:12] (03PS1) 10Neriah: Replace wgNewUserMessageOnAutoCreate with wgNewUserMessageOnFirstEdit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299626 (https://phabricator.wikimedia.org/T426206) [20:37:01] (03CR) 10Neriah: [C:04-1] "wait..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299626 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [20:37:50] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a8 [vendor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299588 (https://phabricator.wikimedia.org/T378906) (owner: 10C. Scott Ananian) [20:38:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:38:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:38:44] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a8 [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299589 (https://phabricator.wikimedia.org/T428270) (owner: 10C. Scott Ananian) [20:39:14] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1299588|Bump wikimedia/parsoid to 0.24.0-a8 (T378906 T420336 T424427 T427664 T427972 T428452 T428270)]], [[gerrit:1299589|Bump wikimedia/parsoid to 0.24.0-a8 (T428270)]] [20:39:33] T378906: Categories as link tags cause navboxes to have a rendering difference - https://phabricator.wikimedia.org/T378906 [20:39:34] T420336: mw-parsoid improvements - https://phabricator.wikimedia.org/T420336 [20:39:34] T424427: Rich Attributes Phase 3 Alternative 3c: implicit schema via attribute registry - https://phabricator.wikimedia.org/T424427 [20:39:35] T427664: pwrapping skips td cells of tables that are embedded in dllists - https://phabricator.wikimedia.org/T427664 [20:39:35] T427972: Edge case in processing templated extlink with an entity-encoded ] char in the url - https://phabricator.wikimedia.org/T427972 [20:39:36] T428452: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T428452 [20:39:36] T428270: CTT tasks week of 2026-05-06 - https://phabricator.wikimedia.org/T428270 [20:41:15] !log cscott@deploy1003 cscott: Backport for [[gerrit:1299588|Bump wikimedia/parsoid to 0.24.0-a8 (T378906 T420336 T424427 T427664 T427972 T428452 T428270)]], [[gerrit:1299589|Bump wikimedia/parsoid to 0.24.0-a8 (T428270)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:42:07] (03PS8) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [20:42:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-wdqs2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:42:57] 10ops-ulsfo, 06SRE, 06DC-Ops: asw1-22-ulsfo:PSU1 down - https://phabricator.wikimedia.org/T428678#12002285 (10RobH) 01175946 created with digital realty: > Support, > > One of our access switches in our racks has recently shown its PSU1 as not receiving power. We would like to use remote hands to both tro... [20:43:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-wdqs2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:43:05] (03PS1) 10Ryan Kemper: query_service: fix and prune allowlist URLs [puppet] - 10https://gerrit.wikimedia.org/r/1299627 (https://phabricator.wikimedia.org/T419205) [20:43:08] (03PS1) 10Ryan Kemper: query_service: add https Finto allowlist entry [puppet] - 10https://gerrit.wikimedia.org/r/1299628 (https://phabricator.wikimedia.org/T420702) [20:43:11] (03PS1) 10Ryan Kemper: query_service: add QLever OSM allowlist entry [puppet] - 10https://gerrit.wikimedia.org/r/1299629 (https://phabricator.wikimedia.org/T420705) [20:43:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs2002.codfw.wmnet with OS trixie [20:43:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12002307 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhanco... [20:45:25] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1299627 (https://phabricator.wikimedia.org/T419205) (owner: 10Ryan Kemper) [20:45:35] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1299628 (https://phabricator.wikimedia.org/T420702) (owner: 10Ryan Kemper) [20:45:39] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1299629 (https://phabricator.wikimedia.org/T420705) (owner: 10Ryan Kemper) [20:46:13] !log cscott@deploy1003 cscott: Continuing with deployment [20:50:16] jhancock@cumin2002 reimage (PID 1521351) is awaiting input [20:50:28] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299588|Bump wikimedia/parsoid to 0.24.0-a8 (T378906 T420336 T424427 T427664 T427972 T428452 T428270)]], [[gerrit:1299589|Bump wikimedia/parsoid to 0.24.0-a8 (T428270)]] (duration: 11m 13s) [20:50:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs2002.codfw.wmnet with OS trixie [20:50:42] T378906: Categories as link tags cause navboxes to have a rendering difference - https://phabricator.wikimedia.org/T378906 [20:50:43] T420336: mw-parsoid improvements - https://phabricator.wikimedia.org/T420336 [20:50:44] T424427: Rich Attributes Phase 3 Alternative 3c: implicit schema via attribute registry - https://phabricator.wikimedia.org/T424427 [20:50:46] T427664: pwrapping skips td cells of tables that are embedded in dllists - https://phabricator.wikimedia.org/T427664 [20:50:47] T427972: Edge case in processing templated extlink with an entity-encoded ] char in the url - https://phabricator.wikimedia.org/T427972 [20:50:47] T428452: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T428452 [20:50:48] T428270: CTT tasks week of 2026-05-06 - https://phabricator.wikimedia.org/T428270 [20:51:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#12002443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@c... [20:51:15] (03CR) 10Ryan Kemper: [C:03+2] query_service: fix and prune allowlist URLs [puppet] - 10https://gerrit.wikimedia.org/r/1299627 (https://phabricator.wikimedia.org/T419205) (owner: 10Ryan Kemper) [20:51:18] (03CR) 10Ryan Kemper: [C:03+2] query_service: add https Finto allowlist entry [puppet] - 10https://gerrit.wikimedia.org/r/1299628 (https://phabricator.wikimedia.org/T420702) (owner: 10Ryan Kemper) [20:51:21] (03CR) 10Ryan Kemper: [C:03+2] query_service: add QLever OSM allowlist entry [puppet] - 10https://gerrit.wikimedia.org/r/1299629 (https://phabricator.wikimedia.org/T420705) (owner: 10Ryan Kemper) [20:52:14] (03PS1) 10Jasmine: service::catalog:sophroid: switch http probe to tcp [puppet] - 10https://gerrit.wikimedia.org/r/1299631 (https://phabricator.wikimedia.org/T428133) [20:52:33] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:56:24] 10ops-ulsfo, 06SRE, 06DC-Ops: asw1-22-ulsfo:PSU1 down - https://phabricator.wikimedia.org/T428678#12002478 (10cmooney) 05Open→03Resolved And like that it was fixed! ` A:cmooney@asw1-22-ulsfo# show platform power-supply +--------------+----+-------------+-------------------+----------------------+--... [20:56:35] (03CR) 10RLazarus: [C:03+1] service::catalog:sophroid: switch http probe to tcp [puppet] - 10https://gerrit.wikimedia.org/r/1299631 (https://phabricator.wikimedia.org/T428133) (owner: 10Jasmine) [20:57:36] (03PS2) 10Jasmine: service::catalog: switch sophroid http probe to tcp [puppet] - 10https://gerrit.wikimedia.org/r/1299631 (https://phabricator.wikimedia.org/T428133) [21:00:02] preparing to do a security deploy, is anything in progress? [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260609T2100) [21:00:26] any folks from the Readers team planning to deploy? [21:03:02] (03CR) 10Jasmine: [C:03+2] service::catalog: switch sophroid http probe to tcp [puppet] - 10https://gerrit.wikimedia.org/r/1299631 (https://phabricator.wikimedia.org/T428133) (owner: 10Jasmine) [21:06:14] 10ops-ulsfo, 06SRE, 06DC-Ops: asw1-22-ulsfo:PSU1 down - https://phabricator.wikimedia.org/T428678#12002495 (10RobH) > I reseated PSU1 on asw1-22-ulsfo and it now shows a green LED. > > Both asw1-22-ulsfo and asw1-23-ulsfo are connected to port 22 of the PDU's. The cables are not labeled." [21:06:18] 06SRE, 10DNS, 06Traffic: 10.67.28.73 reverse DNS showing 2(SERVFAIL) - https://phabricator.wikimedia.org/T428573#12002497 (10cmooney) @CDanis not sure if you have any thoughts here? I think because this is a job and not a service endpoint there is no DNS created. And from a bit of brief reading it doesn't... [21:06:45] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [21:07:04] maryum:yes [21:07:15] i have one train blocker i need to get backported [21:07:19] how long do you need? [21:07:27] just 10 more minutes, is that okay? [21:07:30] sure [21:07:33] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:07:38] thanks jdlrobson [21:09:16] 10ops-ulsfo, 06SRE, 06DC-Ops: asw1-22-ulsfo:PSU1 down - https://phabricator.wikimedia.org/T428678#12002501 (10RobH) Not worht the urgent remote hands rate to apply labels, updated netbox with the label temp entry so we have all the info. https://netbox.wikimedia.org/dcim/devices/6642/power-ports/ https:... [21:10:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:10:33] (03PS1) 10Cathal Mooney: Nokia SR-Linux: get specific component status with gnmic [puppet] - 10https://gerrit.wikimedia.org/r/1299634 (https://phabricator.wikimedia.org/T428685) [21:14:37] (03PS2) 10Cathal Mooney: Nokia SR-Linux: get specific component status with gnmic [puppet] - 10https://gerrit.wikimedia.org/r/1299634 (https://phabricator.wikimedia.org/T428685) [21:15:15] maryum: ping me when you are done [21:15:53] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [21:17:12] running the first scap on wmf.5 [21:24:01] running second and final scap wmf.6 [21:24:58] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [21:26:56] jdlrobson scap is finished, thank you! [21:27:04] !log Deployed security fix for T428324 [21:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:26] maryum: thanks! [21:28:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/Vector] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299602 (https://phabricator.wikimedia.org/T428649) (owner: 10Jdlrobson) [21:29:22] (03PS1) 10Jasmine: service::catalog: update sophroid tcp probe to tcp-notls [puppet] - 10https://gerrit.wikimedia.org/r/1299638 (https://phabricator.wikimedia.org/T428133) [21:29:41] (03CR) 10Dzahn: [C:03+2] gerrit: move httpd logs to $site_path/logs [puppet] - 10https://gerrit.wikimedia.org/r/1298939 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [21:29:48] (03PS2) 10Dzahn: gerrit: move httpd logs to $site_path/logs [puppet] - 10https://gerrit.wikimedia.org/r/1298939 (https://phabricator.wikimedia.org/T425667) [21:32:19] (03CR) 10RLazarus: [C:03+1] service::catalog: update sophroid tcp probe to tcp-notls [puppet] - 10https://gerrit.wikimedia.org/r/1299638 (https://phabricator.wikimedia.org/T428133) (owner: 10Jasmine) [21:32:23] (03CR) 10Dzahn: [C:03+2] gerrit: move httpd logs to $site_path/logs [puppet] - 10https://gerrit.wikimedia.org/r/1298939 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [21:32:47] (03CR) 10Jasmine: [C:03+2] service::catalog: update sophroid tcp probe to tcp-notls [puppet] - 10https://gerrit.wikimedia.org/r/1299638 (https://phabricator.wikimedia.org/T428133) (owner: 10Jasmine) [21:33:17] (03PS1) 10Jdlrobson: [Bug] Donor Badge: Remove client prefs for control group [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299639 (https://phabricator.wikimedia.org/T428501) [21:34:52] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1003.wikimedia.org with reason: debug [21:35:12] (03CR) 10Dzahn: [C:03+2] "AH02291: Cannot access directory '/logs/' for error log of vhost" [puppet] - 10https://gerrit.wikimedia.org/r/1298939 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [21:36:01] Jdlrobson: i've got a heading fix to backport (T428677) once you're done [21:36:02] T428677: Something is wrong with the rendering of headings on this page - https://phabricator.wikimedia.org/T428677 [21:36:44] (03PS1) 10C. Scott Ananian: HandleSectionLinks: add temporary fallback to identify html headings [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299640 (https://phabricator.wikimedia.org/T428677) [21:37:51] cscott: i have one more after this one [21:38:04] no worries just ping me when you're done [21:40:00] cscott: will do! [21:41:03] (03Merged) 10jenkins-bot: Revert "Create VectorComponentPageToolbar component" [skins/Vector] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299602 (https://phabricator.wikimedia.org/T428649) (owner: 10Jdlrobson) [21:41:29] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1299602|Revert "Create VectorComponentPageToolbar component" (T428649)]] [21:41:35] T428649: Duplicate p-tb (tools/toolbox) menu in Vector 2022 - https://phabricator.wikimedia.org/T428649 [21:43:28] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1299602|Revert "Create VectorComponentPageToolbar component" (T428649)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:44:17] (03PS1) 10Dzahn: gerrit: fix variable name for site path used in httpd template [puppet] - 10https://gerrit.wikimedia.org/r/1299642 (https://phabricator.wikimedia.org/T425667) [21:44:20] (03PS1) 10Reedy: wmf-config: Add $wmgOATHAuthRequire2FAForAll config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299643 (https://phabricator.wikimedia.org/T420792) [21:44:23] (03PS1) 10Reedy: Set $wmgOATHAuthRequire2FAForAll = true for various private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299644 (https://phabricator.wikimedia.org/T428103) [21:44:26] (03PS1) 10Reedy: Set $wmgOATHAuthRequire2FAForAll = true for all private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299645 (https://phabricator.wikimedia.org/T428103) [21:45:21] (03CR) 10CI reject: [V:04-1] wmf-config: Add $wmgOATHAuthRequire2FAForAll config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299643 (https://phabricator.wikimedia.org/T420792) (owner: 10Reedy) [21:45:24] (03CR) 10CI reject: [V:04-1] Set $wmgOATHAuthRequire2FAForAll = true for various private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299644 (https://phabricator.wikimedia.org/T428103) (owner: 10Reedy) [21:45:30] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [21:45:31] (03CR) 10CI reject: [V:04-1] Set $wmgOATHAuthRequire2FAForAll = true for all private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299645 (https://phabricator.wikimedia.org/T428103) (owner: 10Reedy) [21:46:46] (03PS2) 10Reedy: wmf-config: Add $wmgOATHAuthRequire2FAForAll config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299643 (https://phabricator.wikimedia.org/T420792) [21:46:46] (03PS2) 10Reedy: Set $wmgOATHAuthRequire2FAForAll = true for various private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299644 (https://phabricator.wikimedia.org/T428103) [21:46:46] (03PS2) 10Reedy: Set $wmgOATHAuthRequire2FAForAll = true for all private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299645 (https://phabricator.wikimedia.org/T428103) [21:47:29] (03PS1) 10Cathal Mooney: Validators - add check to make sure dns_name is unique [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1299646 (https://phabricator.wikimedia.org/T428546) [21:48:19] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [21:48:40] (03PS2) 10Cathal Mooney: Validators - add check to make sure dns_name is unique [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1299646 (https://phabricator.wikimedia.org/T428546) [21:49:45] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299602|Revert "Create VectorComponentPageToolbar component" (T428649)]] (duration: 08m 16s) [21:49:50] T428649: [regression] Duplicate p-tb (tools/toolbox) menu in Vector 2022 caused by watchstar/bookmark icon changes - https://phabricator.wikimedia.org/T428649 [21:50:16] (03CR) 10CDobbins: "PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [21:51:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299639 (https://phabricator.wikimedia.org/T428501) (owner: 10Jdlrobson) [21:52:15] (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1299642/8685/gerrit2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1299642 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [21:52:16] !log T428241 removed retired wdqs2009 full-graph journal dump (446G x2, ~892G) from clouddumps100[1-2]:/srv/dumps/xmldatadumps/public/other/wdqs [21:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:21] T428241: Clean up of WDQS full graph endpoint (wdqs2009) Blazegraph journal file - https://phabricator.wikimedia.org/T428241 [21:54:22] (03Merged) 10jenkins-bot: [Bug] Donor Badge: Remove client prefs for control group [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299639 (https://phabricator.wikimedia.org/T428501) (owner: 10Jdlrobson) [21:54:50] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1299639|[Bug] Donor Badge: Remove client prefs for control group (T428501)]] [21:54:55] T428501: [Bug] Donor Badge: Remove client prefs for control group - https://phabricator.wikimedia.org/T428501 [21:55:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:56:54] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1299639|[Bug] Donor Badge: Remove client prefs for control group (T428501)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:56:57] FIRING: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:57:09] (03PS16) 10JHathaway: redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [22:00:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.292s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:00:17] (03CR) 10CI reject: [V:04-1] redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [22:00:18] (03CR) 10Dzahn: "wait one more day or so - releng is likely ok with it but will discuss in tomorrow's meeting" [dns] - 10https://gerrit.wikimedia.org/r/1298744 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [22:00:37] (03PS2) 10Ladsgroup: wikimedia.org: Introduce thumb.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1298821 (https://phabricator.wikimedia.org/T427465) [22:00:46] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 944.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:01:50] (03CR) 10Ladsgroup: "ah, makes sense since that's why. I simply switched to CNAME of dyna then. is it correct?" [dns] - 10https://gerrit.wikimedia.org/r/1298821 (https://phabricator.wikimedia.org/T427465) (owner: 10Ladsgroup) [22:01:57] RESOLVED: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:03:09] (03PS17) 10JHathaway: redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [22:06:04] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on gerrit2003.wikimedia.org with reason: debug [22:07:08] !log gerrit - apache httpd log file location moved to /srv/gerrit/site_path/review_site/logs/ T425667 [22:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:13] T425667: Investigate Gerrit root disk usage and logging - https://phabricator.wikimedia.org/T425667 [22:07:33] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:07:46] (03PS3) 10Dzahn: gerrit: flip direction of symlink for log directories [puppet] - 10https://gerrit.wikimedia.org/r/1298938 (https://phabricator.wikimedia.org/T425667) [22:08:55] (03CR) 10Dzahn: "httpd has been reconfigured to use the new location - but it still links back to the old location - until this is merged" [puppet] - 10https://gerrit.wikimedia.org/r/1298938 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [22:09:28] (03CR) 10Dzahn: "/srv/gerrit/site_path/review_site/logs: symbolic link to /var/log/gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/1298938 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [22:10:24] (03CR) 10Dzahn: "wait, ignore that last comment. this is just for gerrit logs, not about /var/log/apache2 where the httpd logs were before. the point is th" [puppet] - 10https://gerrit.wikimedia.org/r/1298938 (https://phabricator.wikimedia.org/T425667) (owner: 10Dzahn) [22:11:27] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [22:12:28] (03CR) 10JHathaway: "@ltoscano@wikimedia.org I modified the implementation a bit, to try and simplify it, I was able to test it successfully against, cloudvirt" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [22:13:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:15:30] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.047s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:15:47] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299639|[Bug] Donor Badge: Remove client prefs for control group (T428501)]] (duration: 20m 57s) [22:15:51] T428501: [Bug] Donor Badge: Remove client prefs for control group - https://phabricator.wikimedia.org/T428501 [22:16:15] cscott: done sorry for delay [22:16:46] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.188s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:18:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:27:39] 07sre-alert-triage, 06ServiceOps new, 13Patch-For-Review: Alert in need of triage: ProbeDown (instance sophroid:4252) - https://phabricator.wikimedia.org/T428133#12002756 (10jasmine_) Resolving as probes now succeed with probe type `tcp-notls` 🎉 . Much thanks for investigating this in more depth @RLazarus, @... [22:28:11] Jdlrobson: no worries, thanks! [22:28:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299640 (https://phabricator.wikimedia.org/T428677) (owner: 10C. Scott Ananian) [22:33:35] (03CR) 10Dzahn: [C:03+1] trafficserver: add a map for gitlab as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [22:40:47] (03Merged) 10jenkins-bot: HandleSectionLinks: add temporary fallback to identify html headings [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1299640 (https://phabricator.wikimedia.org/T428677) (owner: 10C. Scott Ananian) [22:41:14] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1299640|HandleSectionLinks: add temporary fallback to identify html headings (T428677)]] [22:41:19] T428677: Something is wrong with the rendering of headings on this page - https://phabricator.wikimedia.org/T428677 [22:43:15] !log cscott@deploy1003 cscott: Backport for [[gerrit:1299640|HandleSectionLinks: add temporary fallback to identify html headings (T428677)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:43:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [22:43:44] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [22:43:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:45:55] !log cscott@deploy1003 cscott: Continuing with deployment [22:47:23] 07sre-alert-triage, 06ServiceOps new: Alert in need of triage: ProbeDown (instance sophroid:4252) - https://phabricator.wikimedia.org/T428133#12002823 (10jasmine_) 05Open→03Resolved [22:50:13] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299640|HandleSectionLinks: add temporary fallback to identify html headings (T428677)]] (duration: 08m 59s) [22:50:18] T428677: Something is wrong with the rendering of headings on this page - https://phabricator.wikimedia.org/T428677 [22:50:23] ok done [22:56:46] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 832.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:09:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 810.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:14:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 810.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:15:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 935ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:24:31] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 812.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:37:45] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 806.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:39:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1299653 [23:39:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1299653 (owner: 10TrainBranchBot) [23:42:45] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 806.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:51:11] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1299653 (owner: 10TrainBranchBot) [23:57:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:57:33] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:59:38] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown