[00:14:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:42:52] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:43:40] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [00:44:25] (03PS2) 10CDanis: cli: add --sort-groups and --reverse-sort options [software/cumin] - 10https://gerrit.wikimedia.org/r/1294990 [00:45:26] (03CR) 10CDanis: cli: add --sort-groups and --reverse-sort options (039 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/1294990 (owner: 10CDanis) [01:00:03] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1295788 (owner: 10TrainBranchBot) [01:09:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.47.0-wmf.5 [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296056 (https://phabricator.wikimedia.org/T423914) [01:09:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.47.0-wmf.5 [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296056 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [01:09:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1296057 [01:09:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1296057 (owner: 10TrainBranchBot) [01:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:10:18] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [01:16:15] (03CR) 10C. Scott Ananian: [C:03+1] Deploy PRV to 5 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296015 (https://phabricator.wikimedia.org/T427851) (owner: 10Arlolra) [01:21:04] (03Merged) 10jenkins-bot: Branch commit for wmf/1.47.0-wmf.5 [core] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296056 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [01:22:36] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1296057 (owner: 10TrainBranchBot) [01:32:05] (03PS1) 10RLazarus: Copy mesh.networkpolicy 1.2.1 -> 1.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296063 [01:32:05] (03PS1) 10RLazarus: Copy mesh.configuration 1.15.2 -> 1.15.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296064 [01:32:06] (03PS1) 10RLazarus: mesh.networkpolicy: Handle a services_proxy entry with no upstream.ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) [01:32:07] (03PS1) 10RLazarus: Copy mesh.service 1.2.0 -> 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296066 [01:32:07] (03PS1) 10RLazarus: mesh.configuration: Add restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) [01:32:12] (03PS1) 10RLazarus: mesh.service: Add TLS service ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296068 (https://phabricator.wikimedia.org/T427863) [01:32:16] (03PS1) 10RLazarus: function-{evaluator,orchestrator}: sextant update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296069 (https://phabricator.wikimedia.org/T427863) [01:32:20] (03PS1) 10RLazarus: orchestrator: Add restricted_listeners ports to network egress policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296070 (https://phabricator.wikimedia.org/T427863) [01:32:24] (03PS1) 10RLazarus: wikifunctions: Add mesh.restricted_listeners port to orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296071 (https://phabricator.wikimedia.org/T427863) [01:32:28] (03PS1) 10RLazarus: function-evaluator: Add outgoing Envoy config and egress policy for callbacks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296072 (https://phabricator.wikimedia.org/T427863) [01:33:34] (03PS1) 10RLazarus: services_proxy: "Reserve" local port 6520 for wikifunctions orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/1296073 (https://phabricator.wikimedia.org/T427863) [01:38:03] (03CR) 10RLazarus: [C:03+2] services_proxy: "Reserve" local port 6520 for wikifunctions orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/1296073 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [01:42:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:43:03] (03CR) 10Krinkle: P:cache:haproxy add image generator information (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) (owner: 10Slyngshede) [01:47:52] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:48:40] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T0200) [02:08:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:19] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:26:19] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:28:34] (03PS2) 10RLazarus: mesh.configuration: Add restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) [02:28:34] (03PS2) 10RLazarus: mesh.service: Add TLS service ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296068 (https://phabricator.wikimedia.org/T427863) [02:28:34] (03PS2) 10RLazarus: function-{evaluator,orchestrator}: sextant update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296069 (https://phabricator.wikimedia.org/T427863) [02:28:35] (03PS2) 10RLazarus: orchestrator: Add restricted_listeners ports to network egress policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296070 (https://phabricator.wikimedia.org/T427863) [02:28:36] (03PS2) 10RLazarus: wikifunctions: Add mesh.restricted_listeners port to orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296071 (https://phabricator.wikimedia.org/T427863) [02:28:37] (03PS2) 10RLazarus: function-evaluator: Add outgoing Envoy config and egress policy for callbacks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296072 (https://phabricator.wikimedia.org/T427863) [02:29:19] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:29:19] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:31:19] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:33:07] (03CR) 10RLazarus: "See https://gerrit.wikimedia.org/r/1296072 for why this is needed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [02:33:09] (03CR) 10RLazarus: mesh.configuration: Add restricted_listeners (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [02:34:19] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:35:56] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:19] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:37:19] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T0300) [03:14:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:49:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T0400) [04:05:40] !log mwpresync@deploy1003 Pruned MediaWiki: 1.47.0-wmf.2 (duration: 05m 33s) [04:13:56] (03CR) 10Ryan Kemper: [C:03+2] dse-k8s-codfw: Add wdqs namespaces for the new deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295465 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [04:22:05] (03Merged) 10jenkins-bot: dse-k8s-codfw: Add wdqs namespaces for the new deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295465 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [04:35:13] 06SRE, 06Traffic, 13Patch-For-Review: Move contact info detection at the edge to a lua module - https://phabricator.wikimedia.org/T414300#11974879 (10Joe) 05Open→03Resolved [04:36:26] !log ryankemper@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [04:40:43] !log ryankemper@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [04:46:39] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [04:49:51] !log T425007 (k8s) created 4 wdqs namespaces on `dse-k8s-codfw`'s `admin_ng` ns: `wdqs-[internal,external]` & `wdqs-[internal,external]-next`; certs issued [04:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:54] T425007: Helm chart for wdqs-qlever and wdqs-streaming-consumer - https://phabricator.wikimedia.org/T425007 [04:56:13] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [04:59:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:32] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [05:02:08] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [05:05:52] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [05:06:13] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1052: Upgrading es1052.eqiad.wmnet [05:06:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1052: Upgrading es1052.eqiad.wmnet [05:07:26] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1052.eqiad.wmnet with OS trixie [05:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:10:18] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [05:14:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:04] (03PS14) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [05:21:01] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [05:22:43] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1052.eqiad.wmnet with reason: host reimage [05:25:12] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [05:26:13] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [05:28:49] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [05:29:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1052.eqiad.wmnet with reason: host reimage [05:29:44] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [05:30:34] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [05:33:20] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [05:36:40] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [05:42:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:45:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1052.eqiad.wmnet with OS trixie [05:46:22] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [05:47:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:47:21] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [05:48:44] marostegui@cumin1003 major-upgrade (PID 3861657) is awaiting input [05:50:47] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [05:50:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1052: repool after upgrade [05:51:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T426088 [05:51:42] T426088: Switchover s7 master (db1181 -> db1236) - https://phabricator.wikimedia.org/T426088 [05:51:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1236 with weight 0 T426088', diff saved to https://phabricator.wikimedia.org/P93470 and previous config saved to /var/cache/conftool/dbconfig/20260602-055153-marostegui.json [05:52:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:52:17] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1236 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1286416 (https://phabricator.wikimedia.org/T426088) [05:53:07] (03PS1) 10Marostegui: wmnet: Update s7 CNAME [dns] - 10https://gerrit.wikimedia.org/r/1296248 (https://phabricator.wikimedia.org/T426088) [05:53:17] (03Abandoned) 10Marostegui: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286417 (https://phabricator.wikimedia.org/T426088) (owner: 10Gerrit maintenance bot) [05:54:36] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1236 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1286416 (https://phabricator.wikimedia.org/T426088) (owner: 10Gerrit maintenance bot) [06:00:02] !log Starting s7 eqiad failover from db1181 to db1236 - T426088 [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T0600) [06:00:05] marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T0600). [06:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:06] T426088: Switchover s7 master (db1181 -> db1236) - https://phabricator.wikimedia.org/T426088 [06:00:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s7 eqiad as read-only for maintenance - T426088', diff saved to https://phabricator.wikimedia.org/P93471 and previous config saved to /var/cache/conftool/dbconfig/20260602-060018-marostegui.json [06:00:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1236 to s7 primary and set section read-write T426088', diff saved to https://phabricator.wikimedia.org/P93472 and previous config saved to /var/cache/conftool/dbconfig/20260602-060041-marostegui.json [06:01:11] (03CR) 10Marostegui: [C:03+2] wmnet: Update s7 CNAME [dns] - 10https://gerrit.wikimedia.org/r/1296248 (https://phabricator.wikimedia.org/T426088) (owner: 10Marostegui) [06:01:23] !log marostegui@dns1004 START - running authdns-update [06:01:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1181 T426088', diff saved to https://phabricator.wikimedia.org/P93473 and previous config saved to /var/cache/conftool/dbconfig/20260602-060157-marostegui.json [06:02:49] !log marostegui@dns1004 END - running authdns-update [06:04:09] (03PS1) 10Marostegui: db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1296249 (https://phabricator.wikimedia.org/T425388) [06:04:41] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [06:04:43] (03CR) 10Marostegui: [C:03+2] db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1296249 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [06:04:59] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [06:05:08] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1181: Upgrading db1181.eqiad.wmnet [06:05:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1181: Upgrading db1181.eqiad.wmnet [06:06:41] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [06:07:26] (03PS1) 10Jelto: miscweb: update wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296250 (https://phabricator.wikimedia.org/T414405) [06:08:33] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1181.eqiad.wmnet with OS trixie [06:10:24] (03CR) 10Jelto: [C:03+2] miscweb: update wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296250 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [06:12:46] (03Merged) 10jenkins-bot: miscweb: update wmf-navigator images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296250 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [06:15:46] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [06:16:21] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [06:21:20] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [06:22:12] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [06:24:31] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1181.eqiad.wmnet with reason: host reimage [06:29:51] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: host reimage [06:30:30] (03PS1) 10Muehlenhoff: profile::firewall: Allow to provide more fine-grained access from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1296251 [06:36:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1052: repool after upgrade [06:36:45] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [06:36:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet [06:36:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:37:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2045.codfw.wmnet [06:37:23] RECOVERY - Host titan1002 is UP: PING WARNING - Packet loss = 90%, RTA = 739.19 ms [06:37:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2045.codfw.wmnet [06:38:15] (03CR) 10Arnaudb: [C:03+1] "question inline, looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn) [06:40:36] jmm@cumin2002 drain-node (PID 3495911) is awaiting input [06:41:38] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [06:41:56] (03PS1) 10Marostegui: Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1296252 [06:41:57] RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:41:59] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2053: Upgrading es2053.codfw.wmnet [06:42:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2053: Upgrading es2053.codfw.wmnet [06:42:59] (03CR) 10Marostegui: [C:03+2] Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1296252 (owner: 10Marostegui) [06:43:21] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2053.codfw.wmnet with OS trixie [06:46:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1181.eqiad.wmnet with OS trixie [06:50:48] (03CR) 10DCausse: [C:03+1] translate: adding separate read/write endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294949 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [06:55:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [06:55:32] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1181: Migration of db1181.eqiad.wmnet completed [06:55:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93478 and previous config saved to /var/cache/conftool/dbconfig/20260602-065533-fceratto.json [06:57:05] (03CR) 10Marostegui: "Just a typo and a question, can you run this PCC for also the following hosts, just in case:" [puppet] - 10https://gerrit.wikimedia.org/r/1296251 (owner: 10Muehlenhoff) [06:59:27] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2053.codfw.wmnet with reason: host reimage [07:00:05] Amir1, urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T0700). [07:00:05] atsukoito: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:27] hi! [07:01:33] o/ [07:02:37] (03PS2) 10Muehlenhoff: profile::firewall: Allow to provide more fine-grained access from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1296251 [07:02:47] (03CR) 10Muehlenhoff: [C:03+2] Apply cluster::management role to cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1289272 (owner: 10Muehlenhoff) [07:04:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2053.codfw.wmnet with reason: host reimage [07:05:10] (03CR) 10Slyngshede: [C:03+1] admin: upgrade Mahmoud Abdelsattar from ldap_only to shell user [puppet] - 10https://gerrit.wikimedia.org/r/1295952 (https://phabricator.wikimedia.org/T427597) (owner: 10Dzahn) [07:12:20] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db2241: Depool for rack maintenance [07:12:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2186.codfw.wmnet with reason: upgrade [07:12:52] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2241: Depool for rack maintenance [07:12:54] (03CR) 10Dpogorzelski: [C:03+1] ml-services: Bump llm ns memory quota to 256Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295958 (owner: 10Bartosz Wójtowicz) [07:13:04] (03CR) 10Dpogorzelski: [C:03+2] ml-services: Bump llm ns memory quota to 256Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295958 (owner: 10Bartosz Wójtowicz) [07:14:47] !log Install mariadb 10.11.17 on db2186 T427345 [07:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:50] T427345: Compile and package MariaDB 10.11.17 - https://phabricator.wikimedia.org/T427345 [07:15:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2241.codfw.wmnet with reason: Depool for rack maintenance [07:16:04] (03PS1) 10Muehlenhoff: mariadb::wmf_root_client: Add cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1296255 [07:16:15] (03PS2) 10Muehlenhoff: mariadb::wmf_root_client: Add cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1296255 [07:16:24] dcausse: i'm ready to backport 1294949: translate: adding separate read/write endpoints | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1294949 [07:16:54] atsukoito: sounds good [07:17:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by atsuko@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294949 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:18:38] (03Merged) 10jenkins-bot: translate: adding separate read/write endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294949 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:19:21] !log atsuko@deploy1003 Started scap sync-world: Backport for [[gerrit:1294949|translate: adding separate read/write endpoints (T425377)]] [07:19:25] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [07:19:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2053.codfw.wmnet with OS trixie [07:21:04] (03Merged) 10jenkins-bot: ml-services: Bump llm ns memory quota to 256Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295958 (owner: 10Bartosz Wójtowicz) [07:21:17] dcausse: after the debug is done, i'll run mwscript https://phabricator.wikimedia.org/T425377#11915906 to test the config before proceeding [07:21:17] !log atsuko@deploy1003 atsuko: Backport for [[gerrit:1294949|translate: adding separate read/write endpoints (T425377)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:22:12] atsukoito: ok, will test some special pages in the meantime [07:23:06] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2241.codfw.wmnet [07:23:07] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2241.codfw.wmnet [07:23:39] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2241: Depool for rack maintenance [07:24:12] marostegui@cumin1003 major-upgrade (PID 3876624) is awaiting input [07:25:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11975162 (10ayounsi) [07:25:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool pc2021: rack A3 maintenance [07:25:50] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [07:26:05] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:26:05] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc2021: rack A3 maintenance [07:26:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11975168 (10ayounsi) [07:26:49] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on pc2021.codfw.wmnet with reason: rack A3 maintenance [07:27:39] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [07:28:02] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1296251 (owner: 10Muehlenhoff) [07:28:33] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db2158: rack A3 maintenance [07:28:54] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2158: rack A3 maintenance [07:28:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T426633)', diff saved to https://phabricator.wikimedia.org/P93487 and previous config saved to /var/cache/conftool/dbconfig/20260602-072856-fceratto.json [07:29:35] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2158.codfw.wmnet with reason: rack A3 maintenance [07:30:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2053: repool after upgrade [07:32:37] !log pfw1-eqiad# delete protocols bgp group Production family inet6 - T423384 [07:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:41] T423384: Investigate internal rejected prefixes - https://phabricator.wikimedia.org/T423384 [07:36:44] dcausse and me decided on reverting 1294949, won't proceed with promoting testing to prod [07:39:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P93490 and previous config saved to /var/cache/conftool/dbconfig/20260602-073904-fceratto.json [07:39:34] !log atsuko@deploy1003 atsuko: Rolling back deployment [07:40:23] !log atsuko@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294949|translate: adding separate read/write endpoints (T425377)]] (duration: 21m 01s) [07:40:26] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [07:40:42] (03CR) 10Marostegui: [C:03+1] "We also need to add the database grants for this host." [puppet] - 10https://gerrit.wikimedia.org/r/1296255 (owner: 10Muehlenhoff) [07:41:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1181: Migration of db1181.eqiad.wmnet completed [07:41:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [07:41:54] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1180: Pooling [07:42:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1181.eqiad.wmnet with reason: Reboot [07:43:09] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db1181: Reboot [07:43:14] (03CR) 10Marostegui: [C:03+1] "https://phabricator.wikimedia.org/T427884" [puppet] - 10https://gerrit.wikimedia.org/r/1296255 (owner: 10Muehlenhoff) [07:43:34] (03CR) 10Muehlenhoff: "Yes, that's for followup later. Initially we first need to get all packages properly installed on Trixie-compatible versions etc." [puppet] - 10https://gerrit.wikimedia.org/r/1296255 (owner: 10Muehlenhoff) [07:43:36] (03CR) 10Muehlenhoff: [C:03+2] mariadb::wmf_root_client: Add cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1296255 (owner: 10Muehlenhoff) [07:44:01] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1181: Reboot [07:44:28] (03PS1) 10Atsuko: translate: fixing missed variable in credentials formatting closure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296262 (https://phabricator.wikimedia.org/T425377) [07:45:19] (03CR) 10DCausse: [C:03+1] translate: fixing missed variable in credentials formatting closure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296262 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:45:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296262 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:45:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296262 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:47:19] dcausse: applying forward fix, 1296262: translate: fixing missed variable in credentials formatting closure | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1296262 [07:47:30] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1181: Pooling [07:47:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by atsuko@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296262 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:48:27] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1181: Pooling [07:48:46] (03Merged) 10jenkins-bot: translate: fixing missed variable in credentials formatting closure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296262 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:49:02] !log atsuko@deploy1003 Started scap sync-world: Backport for [[gerrit:1296262|translate: fixing missed variable in credentials formatting closure (T425377)]] [07:49:05] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [07:50:46] !log atsuko@deploy1003 atsuko: Backport for [[gerrit:1296262|translate: fixing missed variable in credentials formatting closure (T425377)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:54:00] (03PS1) 10Muehlenhoff: Add dbbackups profile for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1296371 [07:54:11] (03PS2) 10Muehlenhoff: Add dbbackups profile for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1296371 [07:57:09] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1180: Pooling [07:57:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:57:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1181 (T419635)', diff saved to https://phabricator.wikimedia.org/P93498 and previous config saved to /var/cache/conftool/dbconfig/20260602-075759-fceratto.json [07:58:03] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:58:49] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:59:39] !log atsuko@deploy1003 atsuko: Rolling back deployment [07:59:47] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:59:52] rolling back [08:00:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T419635)', diff saved to https://phabricator.wikimedia.org/P93499 and previous config saved to /var/cache/conftool/dbconfig/20260602-080011-fceratto.json [08:02:06] (03PS1) 10Atsuko: Revert "translate: fixing missed variable in credentials formatting closure" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296488 [08:03:31] (03PS20) 10Ayounsi: Create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [08:03:42] (03PS1) 10Atsuko: Revert "translate: adding separate read/write endpoints" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296489 [08:03:49] !log atsuko@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296262|translate: fixing missed variable in credentials formatting closure (T425377)]] (duration: 14m 47s) [08:03:53] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [08:07:35] (03CR) 10CWilliams: sre.mysql.global-read-only Set all sections as RO/RW (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [08:08:44] (03PS2) 10Atsuko: Revert "translate: adding separate read/write endpoints" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296488 (https://phabricator.wikimedia.org/T425377) [08:09:09] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2241: Depool for rack maintenance [08:09:11] (03Abandoned) 10Atsuko: Revert "translate: adding separate read/write endpoints" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296489 (owner: 10Atsuko) [08:09:34] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:10:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P93502 and previous config saved to /var/cache/conftool/dbconfig/20260602-081018-fceratto.json [08:10:29] !log Install mariadb 10.11.17 on es2053 T427345 [08:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:32] T427345: Compile and package MariaDB 10.11.17 - https://phabricator.wikimedia.org/T427345 [08:11:17] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:11:59] (03CR) 10DCausse: [C:03+1] Revert "translate: adding separate read/write endpoints" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296488 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [08:12:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by atsuko@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296488 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [08:12:29] backporting revert [08:13:15] (03Merged) 10jenkins-bot: Revert "translate: adding separate read/write endpoints" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296488 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [08:13:30] !log atsuko@deploy1003 Started scap sync-world: Backport for [[gerrit:1296488|Revert "translate: adding separate read/write endpoints" (T425377)]] [08:13:34] (03CR) 10Jcrespo: [C:04-1] Add dbbackups profile for cumin2003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296371 (owner: 10Muehlenhoff) [08:13:34] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [08:13:58] (03PS3) 10Muehlenhoff: sre.puppet.disable-merges: New cookbook to disable Puppet merges temporarily [cookbooks] - 10https://gerrit.wikimedia.org/r/1295425 (https://phabricator.wikimedia.org/T248872) [08:15:15] !log atsuko@deploy1003 atsuko: Backport for [[gerrit:1296488|Revert "translate: adding separate read/write endpoints" (T425377)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:16:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2053: repool after upgrade [08:16:33] !log atsuko@deploy1003 atsuko: Rolling back deployment [08:17:03] !log atsuko@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296488|Revert "translate: adding separate read/write endpoints" (T425377)]] (duration: 03m 33s) [08:17:13] rolling back deployment is a normal operation, we didn't want to leave the testing hanging [08:17:26] that concludes the backport window, thanks dcausse [08:17:29] (03CR) 10Slyngshede: [C:03+2] P:idp webauthn, with database backend [puppet] - 10https://gerrit.wikimedia.org/r/1282286 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [08:17:35] atsukoito: thanks! [08:17:38] (03PS3) 10Muehlenhoff: Add dbbackups profile for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1296371 [08:18:26] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:18:27] (03CR) 10Muehlenhoff: Add dbbackups profile for cumin2003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296371 (owner: 10Muehlenhoff) [08:18:43] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:19:40] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:19:40] (03CR) 10Muehlenhoff: sre.puppet.disable-merges: New cookbook to disable Puppet merges temporarily (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1295425 (https://phabricator.wikimedia.org/T248872) (owner: 10Muehlenhoff) [08:20:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P93504 and previous config saved to /var/cache/conftool/dbconfig/20260602-082026-fceratto.json [08:20:43] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:21:16] (03CR) 10Jcrespo: "I won't -1 this, but I suggest to keep the general parts with sections: [], I cannot guarantee this will not cause systemd alerts because " [puppet] - 10https://gerrit.wikimedia.org/r/1296371 (owner: 10Muehlenhoff) [08:22:33] (03PS4) 10Muehlenhoff: Add dbbackups profile for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1296371 [08:25:26] (03CR) 10Jcrespo: [C:03+1] Add dbbackups profile for cumin2003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296371 (owner: 10Muehlenhoff) [08:29:35] !log IDP, new configuration in preparation for webauthn [08:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:07] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [08:30:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T419635)', diff saved to https://phabricator.wikimedia.org/P93505 and previous config saved to /var/cache/conftool/dbconfig/20260602-083033-fceratto.json [08:30:38] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:30:50] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [08:33:20] (03PS1) 10Arnaudb: puppetserver: pull puppet via discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1296495 (https://phabricator.wikimedia.org/T420184) [08:33:20] (03CR) 10Arnaudb: "pcc output visible here: https://puppet-compiler.wmflabs.org/output/1296495/6927/puppetserver1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1296495 (https://phabricator.wikimedia.org/T420184) (owner: 10Arnaudb) [08:33:45] (03CR) 10Marostegui: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [08:37:15] !log Reset user email of Barras@votewiki to the one of Barras@SUL [08:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:23] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [08:39:32] jouncebot: nowandnext [08:39:32] No deployments scheduled for the next 1 hour(s) and 20 minute(s) [08:39:32] In 1 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1000) [08:39:46] 07Puppet, 06collaboration-services, 10Gerrit, 06Infrastructure-Foundations, 13Patch-For-Review: Change puppet-merge git origin to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org - https://phabricator.wikimedia.org/T420184#11975411 (10ABran-WMF) >>! In T420184#11968357, @Dzahn wrote: > The stri... [08:41:23] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [08:43:26] (03PS1) 10Bartosz Wójtowicz: ml-services: Bump experimental ns memory quota to 256Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296500 [08:44:15] (03CR) 10Dpogorzelski: [C:03+2] ml-services: Bump experimental ns memory quota to 256Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296500 (owner: 10Bartosz Wójtowicz) [08:46:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2045.codfw.wmnet [08:47:46] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [08:49:00] (03PS1) 10Slyngshede: Enable WebAuthN support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1296505 (https://phabricator.wikimedia.org/T372892) [08:50:23] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml-services: Bump experimental ns memory quota to 256Gi. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296500 (owner: 10Bartosz Wójtowicz) [08:50:46] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [08:50:57] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:51:52] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:52:19] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:52:50] (03PS13) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [08:53:16] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:53:19] (03CR) 10Federico Ceratto: "I added comments in the code and removed an unnecessary sleep" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [08:54:24] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [08:54:36] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [08:55:07] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:56:32] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1055.eqiad.wmnet with OS trixie [08:56:48] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [08:56:57] (03CR) 10Jcrespo: [C:03+1] "Please let us know when you plan to decom 2002, I would like to test backups before migration thorougly." [puppet] - 10https://gerrit.wikimedia.org/r/1296371 (owner: 10Muehlenhoff) [08:59:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2250.codfw.wmnet with reason: rack A3 maintenance [09:01:09] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1296507 (https://phabricator.wikimedia.org/T427892) [09:01:22] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-drmrs unexpected reboot - https://phabricator.wikimedia.org/T427600#11975473 (10cmooney) {F86180058} [09:01:23] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1296508 (https://phabricator.wikimedia.org/T427893) [09:04:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [09:04:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T426633)', diff saved to https://phabricator.wikimedia.org/P93506 and previous config saved to /var/cache/conftool/dbconfig/20260602-090432-fceratto.json [09:06:26] (03Abandoned) 10CWilliams: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1296508 (https://phabricator.wikimedia.org/T427893) (owner: 10Gerrit maintenance bot) [09:08:50] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1295805 (https://phabricator.wikimedia.org/T423384) (owner: 10Ayounsi) [09:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:09:25] (03CR) 10Ayounsi: [C:03+2] Add RejectingBGPPrefixes alert [alerts] - 10https://gerrit.wikimedia.org/r/1295805 (https://phabricator.wikimedia.org/T423384) (owner: 10Ayounsi) [09:09:29] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1055.eqiad.wmnet with reason: host reimage [09:09:49] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1258 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1296510 (https://phabricator.wikimedia.org/T427895) [09:09:54] (03PS1) 10Gerrit maintenance bot: wmnet: Update x3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1296511 (https://phabricator.wikimedia.org/T427895) [09:10:18] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:11:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T426633)', diff saved to https://phabricator.wikimedia.org/P93508 and previous config saved to /var/cache/conftool/dbconfig/20260602-091126-fceratto.json [09:11:32] (03Merged) 10jenkins-bot: Add RejectingBGPPrefixes alert [alerts] - 10https://gerrit.wikimedia.org/r/1295805 (https://phabricator.wikimedia.org/T423384) (owner: 10Ayounsi) [09:13:06] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, good thinking!" [alerts] - 10https://gerrit.wikimedia.org/r/1295919 (https://phabricator.wikimedia.org/T419298) (owner: 10Ayounsi) [09:13:56] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:14:07] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1055.eqiad.wmnet with reason: host reimage [09:15:21] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1187: Pooling [09:15:56] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:21:00] (03CR) 10Muehlenhoff: [C:03+2] Add dbbackups profile for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1296371 (owner: 10Muehlenhoff) [09:22:39] (03CR) 10Btullis: [C:03+2] Configure nginx to log requests in ECS format to syslog [puppet] - 10https://gerrit.wikimedia.org/r/1287407 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [09:24:39] (03CR) 10Ayounsi: [C:03+2] Add InterfaceNoDescription alert [alerts] - 10https://gerrit.wikimedia.org/r/1295919 (https://phabricator.wikimedia.org/T419298) (owner: 10Ayounsi) [09:26:43] (03Merged) 10jenkins-bot: Add InterfaceNoDescription alert [alerts] - 10https://gerrit.wikimedia.org/r/1295919 (https://phabricator.wikimedia.org/T419298) (owner: 10Ayounsi) [09:28:28] FIRING: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:30:21] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1055.eqiad.wmnet with OS trixie [09:30:25] FIRING: SystemdUnitFailed: prometheus-node-textfile-export_service_type.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:32:55] !log temporarily remove ganeti2045 from the codfw cluster T427357 [09:32:58] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Default most APIs to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1293699 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [09:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:00] T427357: codfw: rack A4 maintenance - https://phabricator.wikimedia.org/T427357 [09:33:14] (03PS15) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [09:33:29] (03CR) 10Kosta Harlan: hCaptcha: Load self-hosted secure-api.js on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295909 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [09:33:56] (03PS1) 10Blake: mcrouter_wancache: swap mc1054 for mc1055 to enable decom [puppet] - 10https://gerrit.wikimedia.org/r/1296513 (https://phabricator.wikimedia.org/T426044) [09:34:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of rpki2003.codfw.wmnet to plain [09:34:23] !log Disabling puppet on A:cp-text for ATS rest-gateway cleanup - T422937 [09:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:26] T422937: Cleanup ATS configuration for API paths - https://phabricator.wikimedia.org/T422937 [09:34:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of rpki2003.codfw.wmnet to plain [09:35:01] (03PS1) 10Urbanecm: [Growth] Set wgGEMentorshipCleanupEnabled to false on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296514 (https://phabricator.wikimedia.org/T427386) [09:35:05] PROBLEM - ganeti-noded running on ganeti2045 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:35:07] PROBLEM - ganeti-confd running on ganeti2045 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:35:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow2004.codfw.wmnet to plain [09:35:50] FIRING: ProbeDown: Service ganeti2045:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:37:09] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1159.eqiad.wmnet with reason: Maintenance [09:37:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T426633)', diff saved to https://phabricator.wikimedia.org/P93511 and previous config saved to /var/cache/conftool/dbconfig/20260602-093716-fceratto.json [09:37:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow2004.codfw.wmnet to plain [09:37:39] !log Running puppet on cp6010 and cp6011 - T422937 [09:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:00] (03PS1) 10STran: Add a reply-to to Direct Reporting emails [extensions/ReportIncident] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296516 (https://phabricator.wikimedia.org/T427788) [09:40:17] (03PS1) 10STran: Add a reply-to to Direct Reporting emails [extensions/ReportIncident] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296517 (https://phabricator.wikimedia.org/T427788) [09:40:56] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:42:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:40] (03PS1) 10Urbanecm: growthexperiments.pp: Run cleanMentorList every 3 days [puppet] - 10https://gerrit.wikimedia.org/r/1296519 (https://phabricator.wikimedia.org/T427386) [09:43:43] (03CR) 10Mszwarc: [C:03+1] Add a reply-to to Direct Reporting emails [extensions/ReportIncident] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296516 (https://phabricator.wikimedia.org/T427788) (owner: 10STran) [09:43:47] (03CR) 10Mszwarc: [C:03+1] Add a reply-to to Direct Reporting emails [extensions/ReportIncident] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296517 (https://phabricator.wikimedia.org/T427788) (owner: 10STran) [09:44:04] (03PS1) 10Cathal Mooney: netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) [09:45:42] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1187: Pooling [09:45:46] (03CR) 10CI reject: [V:04-1] netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [09:45:56] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:46:13] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cumin2003.codfw.wmnet with reason: in setup [09:46:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ReportIncident] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296516 (https://phabricator.wikimedia.org/T427788) (owner: 10STran) [09:46:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ReportIncident] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296517 (https://phabricator.wikimedia.org/T427788) (owner: 10STran) [09:49:00] (03CR) 10CI reject: [V:04-1] Add a reply-to to Direct Reporting emails [extensions/ReportIncident] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296516 (https://phabricator.wikimedia.org/T427788) (owner: 10STran) [09:50:50] RESOLVED: ProbeDown: Service ganeti2045:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:56:11] !log Enabling puppet on A:cp-text for ATS rest-gateway cleanup - T422937 [09:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:15] T422937: Cleanup ATS configuration for API paths - https://phabricator.wikimedia.org/T422937 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1000) [10:00:12] (03PS2) 10Cathal Mooney: netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) [10:03:21] (03CR) 10CI reject: [V:04-1] netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [10:03:42] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1296505 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [10:03:59] (03CR) 10Muehlenhoff: [C:03+2] autoinstall: Switch to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1295956 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [10:04:41] (03PS2) 10Jcrespo: dbbackups: Reenable read-only ES backups [puppet] - 10https://gerrit.wikimedia.org/r/1295925 (https://phabricator.wikimedia.org/T424661) [10:05:16] (03PS3) 10Cathal Mooney: netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) [10:05:19] (03PS3) 10Jcrespo: dbbackups: Reenable read-only ES backups [puppet] - 10https://gerrit.wikimedia.org/r/1295925 (https://phabricator.wikimedia.org/T424661) [10:05:37] (03CR) 10Mszwarc: [C:03+1] "recheck" [extensions/ReportIncident] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296516 (https://phabricator.wikimedia.org/T427788) (owner: 10STran) [10:06:26] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/eventstreams-internal: apply [10:06:53] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/eventstreams-internal: apply [10:06:58] (03CR) 10CI reject: [V:04-1] netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [10:08:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:08:24] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2056: Upgrading es2056.codfw.wmnet [10:08:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2056: Upgrading es2056.codfw.wmnet [10:09:03] (03PS4) 10Cathal Mooney: netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) [10:09:09] (03PS4) 10Jcrespo: dbbackups: Reenable read-only ES backups [puppet] - 10https://gerrit.wikimedia.org/r/1295925 (https://phabricator.wikimedia.org/T424661) [10:09:32] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2056.codfw.wmnet with OS trixie [10:10:44] (03CR) 10CI reject: [V:04-1] netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [10:12:13] (03PS1) 10Muehlenhoff: Inline profile::mail::smarthost into profile::mail::smarthost::wmcs (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1296528 [10:12:46] (03PS5) 10Cathal Mooney: netops: set CR packet drop alert to paging and up timer on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) [10:12:50] (03PS1) 10Ilias Sarantopoulos: alertmanager: Add Slack alerts to public slack channel for ML team [puppet] - 10https://gerrit.wikimedia.org/r/1296529 [10:13:53] (03CR) 10CI reject: [V:04-1] Inline profile::mail::smarthost into profile::mail::smarthost::wmcs (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1296528 (owner: 10Muehlenhoff) [10:20:22] (03CR) 10Cathal Mooney: "I think this makes sense but admittedly it's tricky to get thresholds like this right so happy to discuss." [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [10:21:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T426633)', diff saved to https://phabricator.wikimedia.org/P93515 and previous config saved to /var/cache/conftool/dbconfig/20260602-102139-fceratto.json [10:24:58] (03PS2) 10Muehlenhoff: Inline profile::mail::smarthost into profile::mail::smarthost::wmcs (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1296528 [10:25:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2056.codfw.wmnet with reason: host reimage [10:27:14] !log Disabling puppet on A:cp-text for ATS rest-gateway cleanup - T422937 [10:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:17] T422937: Cleanup ATS configuration for API paths - https://phabricator.wikimedia.org/T422937 [10:28:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2056.codfw.wmnet with reason: host reimage [10:31:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P93516 and previous config saved to /var/cache/conftool/dbconfig/20260602-103146-fceratto.json [10:31:54] (03CR) 10Ayounsi: "Some phrasing comments inline, overall lgtm." [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [10:33:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1296528 (owner: 10Muehlenhoff) [10:33:55] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Route /media/math directly to restbase [puppet] - 10https://gerrit.wikimedia.org/r/1293703 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [10:34:04] (03PS2) 10Clément Goubert: trafficserver: Route /media/math directly to restbase [puppet] - 10https://gerrit.wikimedia.org/r/1293703 (https://phabricator.wikimedia.org/T422937) [10:36:07] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Route /media/math directly to restbase [puppet] - 10https://gerrit.wikimedia.org/r/1293703 (https://phabricator.wikimedia.org/T422937) (owner: 10Clément Goubert) [10:36:28] (03PS1) 10Dreamy Jazz: hCaptcha: Deduplicate edit API detection code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296532 (https://phabricator.wikimedia.org/T427887) [10:36:31] (03PS1) 10Dreamy Jazz: hCaptcha: Disable hCaptcha for DiscussionTools for the apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296533 (https://phabricator.wikimedia.org/T427887) [10:36:47] (03PS1) 10Daniel Kinzler: rest-gateway: cost limits for action=parse (shadow mode) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296534 (https://phabricator.wikimedia.org/T405472) [10:37:09] (03PS1) 10Btullis: dumps: web: Fix nginx ECS access log config so nginx can start [puppet] - 10https://gerrit.wikimedia.org/r/1296535 (https://phabricator.wikimedia.org/T291645) [10:39:27] (03CR) 10Cathal Mooney: netops: set CR packet drop alert to paging and up timer on saturation (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1296520 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [10:41:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P93517 and previous config saved to /var/cache/conftool/dbconfig/20260602-104154-fceratto.json [10:42:12] !log Enabling puppet on A:cp-text for ATS rest-gateway cleanup - T422937 [10:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:15] T422937: Cleanup ATS configuration for API paths - https://phabricator.wikimedia.org/T422937 [10:42:28] !log installing busybox security updates [10:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2056.codfw.wmnet with OS trixie [10:45:10] (03CR) 10Effie Mouzeli: [C:03+1] mcrouter_wancache: swap mc1054 for mc1055 to enable decom [puppet] - 10https://gerrit.wikimedia.org/r/1296513 (https://phabricator.wikimedia.org/T426044) (owner: 10Blake) [10:45:18] (03CR) 10Btullis: "Note that the use of 'geo' is definitely a hacky workaround to obtain a literal dollar sign from an nginx config, but it is the closest th" [puppet] - 10https://gerrit.wikimedia.org/r/1296535 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [10:45:37] (03CR) 10Blake: [C:03+2] mcrouter_wancache: swap mc1054 for mc1055 to enable decom [puppet] - 10https://gerrit.wikimedia.org/r/1296513 (https://phabricator.wikimedia.org/T426044) (owner: 10Blake) [10:45:52] (03PS1) 10Majavah: confd: Replace deprecated fact [puppet] - 10https://gerrit.wikimedia.org/r/1296536 [10:45:52] (03PS1) 10Majavah: confd: Add condition to prevent starting without configs [puppet] - 10https://gerrit.wikimedia.org/r/1296537 (https://phabricator.wikimedia.org/T356296) [10:45:54] (03PS6) 10Btullis: Configure rsyslog to forward 'dumps_http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) [10:45:54] (03PS6) 10Btullis: logstash: Consume the ECS dumps webrequest stream from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1295917 (https://phabricator.wikimedia.org/T291645) [10:48:28] marostegui@cumin1003 major-upgrade (PID 4060565) is awaiting input [10:49:21] (03PS15) 10Daniel Kinzler: rest gateway: implement cost-based rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) [10:50:16] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8628/co" [puppet] - 10https://gerrit.wikimedia.org/r/1296537 (https://phabricator.wikimedia.org/T356296) (owner: 10Majavah) [10:51:24] (03CR) 10Slyngshede: [V:03+2 C:03+2] Enable WebAuthN support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1296505 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [10:52:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T426633)', diff saved to https://phabricator.wikimedia.org/P93518 and previous config saved to /var/cache/conftool/dbconfig/20260602-105202-fceratto.json [10:52:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [10:52:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:52:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T426633)', diff saved to https://phabricator.wikimedia.org/P93519 and previous config saved to /var/cache/conftool/dbconfig/20260602-105239-fceratto.json [10:55:24] (03PS2) 10Btullis: Add the new dse-k8s-wdqs nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1292045 (https://phabricator.wikimedia.org/T422038) [10:55:36] (03PS3) 10Btullis: Add the new dse-k8s-wdqs nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1292045 (https://phabricator.wikimedia.org/T422038) [10:56:31] (03PS1) 10Atsuko: opensearch-cluster: anonymous access for ttmsearch and toolhub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296539 (https://phabricator.wikimedia.org/T424248) [10:57:22] (03CR) 10Michael Große: [C:03+1] [Growth] Set wgGEMentorshipCleanupEnabled to false on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296514 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [10:58:12] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1296535 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [10:58:18] (03CR) 10Atsuko: [C:03+1] dumps: web: Fix nginx ECS access log config so nginx can start [puppet] - 10https://gerrit.wikimedia.org/r/1296535 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [10:59:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T426633)', diff saved to https://phabricator.wikimedia.org/P93520 and previous config saved to /var/cache/conftool/dbconfig/20260602-105939-fceratto.json [11:00:29] !incidents [11:00:29] 8038 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [11:00:43] (03CR) 10Btullis: [C:03+2] dumps: web: Fix nginx ECS access log config so nginx can start [puppet] - 10https://gerrit.wikimedia.org/r/1296535 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [11:01:45] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [11:01:59] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2056: repool after upgrade [11:02:41] (03CR) 10Btullis: [C:03+1] opensearch-cluster: anonymous access for ttmsearch and toolhub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296539 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [11:03:02] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s8 T427892 [11:03:06] T427892: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T427892 [11:03:33] (03CR) 10Atsuko: Add the new dse-k8s-wdqs nodes to site.pp and preseed.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1292045 (https://phabricator.wikimedia.org/T422038) (owner: 10Btullis) [11:04:21] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Set db2165 with weight 0 T427892', diff saved to https://phabricator.wikimedia.org/P93522 and previous config saved to /var/cache/conftool/dbconfig/20260602-110420-cwilliams.json [11:05:10] (03PS1) 10Dreamy Jazz: hCaptcha: Don't show AbuseFilter CAPTCHA for unsupported APIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296550 (https://phabricator.wikimedia.org/T427608) [11:05:51] (03CR) 10CI reject: [V:04-1] hCaptcha: Don't show AbuseFilter CAPTCHA for unsupported APIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296550 (https://phabricator.wikimedia.org/T427608) (owner: 10Dreamy Jazz) [11:06:36] (03PS4) 10Btullis: Add the new dse-k8s-wdqs nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1292045 (https://phabricator.wikimedia.org/T422038) [11:06:44] (03CR) 10Atsuko: [C:03+2] opensearch-cluster: anonymous access for ttmsearch and toolhub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296539 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [11:07:11] i'll be reimaging memcached servers to Trixie, ~2 at a time or so, no impact expected; i'll be keeping an eye out for errors (feel free to holler at me if i don't notice) [11:07:21] (03CR) 10Btullis: Add the new dse-k8s-wdqs nodes to site.pp and preseed.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1292045 (https://phabricator.wikimedia.org/T422038) (owner: 10Btullis) [11:08:15] (03CR) 10Michael Große: [C:03+1] growthexperiments.pp: Run cleanMentorList every 3 days [puppet] - 10https://gerrit.wikimedia.org/r/1296519 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [11:08:56] (03CR) 10CWilliams: [C:03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1296507 (https://phabricator.wikimedia.org/T427892) (owner: 10Gerrit maintenance bot) [11:08:56] (03Merged) 10jenkins-bot: opensearch-cluster: anonymous access for ttmsearch and toolhub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296539 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [11:09:02] (03CR) 10Dreamy Jazz: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296550 (https://phabricator.wikimedia.org/T427608) (owner: 10Dreamy Jazz) [11:09:05] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1056.eqiad.wmnet with OS trixie [11:09:24] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1057.eqiad.wmnet with OS trixie [11:09:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P93523 and previous config saved to /var/cache/conftool/dbconfig/20260602-110947-fceratto.json [11:10:49] !log Starting s8 codfw failover from db2161 to db2165 - T427892 [11:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:53] T427892: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T427892 [11:12:01] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Promote db2165 to s8 primary T427892', diff saved to https://phabricator.wikimedia.org/P93524 and previous config saved to /var/cache/conftool/dbconfig/20260602-111200-cwilliams.json [11:12:53] (03CR) 10Atsuko: [C:03+1] Add the new dse-k8s-wdqs nodes to site.pp and preseed.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1292045 (https://phabricator.wikimedia.org/T422038) (owner: 10Btullis) [11:14:31] (03CR) 10Majavah: [C:03+1] designate: remove leftover mcrouter code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T427189) (owner: 10Andrew Bogott) [11:15:12] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Depool db2161 T427892', diff saved to https://phabricator.wikimedia.org/P93525 and previous config saved to /var/cache/conftool/dbconfig/20260602-111511-cwilliams.json [11:15:56] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:16:59] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [11:17:37] (03CR) 10Marostegui: [C:03+1] profile::firewall: Allow to provide more fine-grained access from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1296251 (owner: 10Muehlenhoff) [11:17:59] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [11:18:56] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:19:09] (03PS3) 10Muehlenhoff: Inline profile::mail::smarthost into profile::mail::smarthost::wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1296528 [11:19:27] (03PS4) 10Muehlenhoff: Inline profile::mail::smarthost into profile::mail::smarthost::wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1296528 [11:19:34] (03CR) 10Btullis: [C:03+2] Add the new dse-k8s-wdqs nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1292045 (https://phabricator.wikimedia.org/T422038) (owner: 10Btullis) [11:19:34] (03PS5) 10Muehlenhoff: Inline profile::mail::smarthost into profile::mail::smarthost::wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1296528 [11:19:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P93527 and previous config saved to /var/cache/conftool/dbconfig/20260602-111954-fceratto.json [11:21:23] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:21:26] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1056.eqiad.wmnet with reason: host reimage [11:21:33] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2161: Upgrading db2161.codfw.wmnet [11:21:43] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2161: Upgrading db2161.codfw.wmnet [11:22:12] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1057.eqiad.wmnet with reason: host reimage [11:22:42] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for badlogin on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296551 (https://phabricator.wikimedia.org/T426875) [11:23:21] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2161.codfw.wmnet with OS trixie [11:23:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1296528 (owner: 10Muehlenhoff) [11:23:45] (03PS1) 10JMeybohm: partman/reuse-raid10-6dev.cfg: Use linux-swap as fs identifier [puppet] - 10https://gerrit.wikimedia.org/r/1296553 (https://phabricator.wikimedia.org/T427088) [11:25:18] (03CR) 10Dreamy Jazz: [C:04-1] "Until group0 is wmf.5, this is blocked" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296551 (https://phabricator.wikimedia.org/T426875) (owner: 10Dreamy Jazz) [11:26:05] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1161: Repooling [11:26:15] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1161: Repooling [11:29:17] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1056.eqiad.wmnet with reason: host reimage [11:30:09] (03PS1) 10Jelto: miscweb: fix sleep command in data-sync [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296555 (https://phabricator.wikimedia.org/T414405) [11:30:10] (03CR) 10Btullis: "This has now been validated as per: https://phabricator.wikimedia.org/T425087#11975964" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [11:30:12] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [11:30:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T426633)', diff saved to https://phabricator.wikimedia.org/P93529 and previous config saved to /var/cache/conftool/dbconfig/20260602-113019-fceratto.json [11:30:56] (03CR) 10Effie Mouzeli: site.pp: add rdb2013 and rdb2014 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [11:31:55] PROBLEM - Memcached on mc1057 is CRITICAL: connect to address 10.64.0.197 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [11:32:46] (03CR) 10Muehlenhoff: "(The PCC diff across the moved file is a bit confusing to read)" [puppet] - 10https://gerrit.wikimedia.org/r/1296528 (owner: 10Muehlenhoff) [11:33:03] host is being reimaged ^, something didn't go well with downtiming [11:33:06] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1057.eqiad.wmnet with reason: host reimage [11:33:36] (03PS2) 10Dreamy Jazz: hCaptcha: Don't show AbuseFilter CAPTCHA for wbsetclaim API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296550 (https://phabricator.wikimedia.org/T427608) [11:37:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T426633)', diff saved to https://phabricator.wikimedia.org/P93531 and previous config saved to /var/cache/conftool/dbconfig/20260602-113705-fceratto.json [11:38:32] (03CR) 10Atsuko: [C:03+1] "Reviewed 5->6" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [11:39:28] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#11975998 (10MoritzMuehlenhoff) [11:39:46] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [11:40:38] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2161.codfw.wmnet with reason: host reimage [11:40:55] RECOVERY - Memcached on mc1057 is OK: TCP OK - 0.000 second response time on 10.64.0.197 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [11:41:29] (03CR) 10JMeybohm: [C:03+1] Rakefile: Run chart specific tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282965 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [11:42:26] (03CR) 10Jelto: [C:03+2] miscweb: fix sleep command in data-sync [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296555 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [11:44:07] (03PS2) 10JMeybohm: partman/reuse-raid10-6dev.cfg: Use linux-swap as fs identifier [puppet] - 10https://gerrit.wikimedia.org/r/1296553 (https://phabricator.wikimedia.org/T427088) [11:44:47] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2161.codfw.wmnet with reason: host reimage [11:45:00] (03Merged) 10jenkins-bot: miscweb: fix sleep command in data-sync [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296555 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [11:45:09] (03CR) 10Effie Mouzeli: [C:03+2] ratelimite: update homepage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294314 (https://phabricator.wikimedia.org/T426951) (owner: 10Effie Mouzeli) [11:45:30] (03PS1) 10Kosta Harlan: hCaptcha: Remove apiUrl health check and APCu layer from health checker [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296558 (https://phabricator.wikimedia.org/T421464) [11:45:52] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1056.eqiad.wmnet with OS trixie [11:47:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P93532 and previous config saved to /var/cache/conftool/dbconfig/20260602-114713-fceratto.json [11:47:20] (03Merged) 10jenkins-bot: ratelimite: update homepage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294314 (https://phabricator.wikimedia.org/T426951) (owner: 10Effie Mouzeli) [11:47:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2056: repool after upgrade [11:47:38] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1058.eqiad.wmnet with OS trixie [11:47:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:48:05] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2049: Upgrading es2049.codfw.wmnet [11:48:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2049: Upgrading es2049.codfw.wmnet [11:49:07] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2049.codfw.wmnet with OS trixie [11:49:31] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1057.eqiad.wmnet with OS trixie [11:50:11] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1059.eqiad.wmnet with OS trixie [11:51:17] (03CR) 10Kosta Harlan: [C:03+1] hCaptcha: Disable hCaptcha for DiscussionTools for the apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296533 (https://phabricator.wikimedia.org/T427887) (owner: 10Dreamy Jazz) [11:51:46] (03CR) 10Kosta Harlan: hCaptcha: Enable for badlogin on group0 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296551 (https://phabricator.wikimedia.org/T426875) (owner: 10Dreamy Jazz) [11:51:53] (03PS2) 10Dreamy Jazz: hCaptcha: Enable for badlogin on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296551 (https://phabricator.wikimedia.org/T426875) [11:52:13] (03CR) 10Dreamy Jazz: [C:04-1] hCaptcha: Enable for badlogin on group0 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296551 (https://phabricator.wikimedia.org/T426875) (owner: 10Dreamy Jazz) [11:53:05] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:53:13] (03CR) 10Muehlenhoff: partman/reuse-raid10-6dev.cfg: Use linux-swap as fs identifier (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296553 (https://phabricator.wikimedia.org/T427088) (owner: 10JMeybohm) [11:53:29] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [11:53:48] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [11:54:01] (03CR) 10Jcrespo: [C:03+1] dbbackups: Reenable read-only ES backups [puppet] - 10https://gerrit.wikimedia.org/r/1295925 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [11:54:04] (03CR) 10Jcrespo: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295925 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [11:55:00] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [11:55:29] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:55:34] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:55:42] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [11:57:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P93535 and previous config saved to /var/cache/conftool/dbconfig/20260602-115721-fceratto.json [11:58:24] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1200) [12:00:21] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1058.eqiad.wmnet with reason: host reimage [12:01:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2161.codfw.wmnet with OS trixie [12:02:17] jouncebot: nowandnext [12:02:17] For the next 0 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1200) [12:02:17] In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1300) [12:02:26] Anyone using scap in this window? [12:02:30] (03PS1) 10Mpostoronca: wmf-config: Disable hCaptcha for action=mcrundo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) [12:02:42] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1059.eqiad.wmnet with reason: host reimage [12:03:42] (03CR) 10Muehlenhoff: site.pp: add rdb2013 and rdb2014 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [12:04:39] (03CR) 10Muehlenhoff: [C:03+1] partman/reuse-raid10-6dev.cfg: Use linux-swap as fs identifier [puppet] - 10https://gerrit.wikimedia.org/r/1296553 (https://phabricator.wikimedia.org/T427088) (owner: 10JMeybohm) [12:04:41] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1058.eqiad.wmnet with reason: host reimage [12:04:49] (03CR) 10Dreamy Jazz: wmf-config: Disable hCaptcha for action=mcrundo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) (owner: 10Mpostoronca) [12:05:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296532 (https://phabricator.wikimedia.org/T427887) (owner: 10Dreamy Jazz) [12:05:19] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2049.codfw.wmnet with reason: host reimage [12:05:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296533 (https://phabricator.wikimedia.org/T427887) (owner: 10Dreamy Jazz) [12:06:40] (03CR) 10Jcrespo: [C:03+2] dbbackups: Reenable read-only ES backups [puppet] - 10https://gerrit.wikimedia.org/r/1295925 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [12:06:45] (03Merged) 10jenkins-bot: hCaptcha: Deduplicate edit API detection code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296532 (https://phabricator.wikimedia.org/T427887) (owner: 10Dreamy Jazz) [12:06:54] (03Merged) 10jenkins-bot: hCaptcha: Disable hCaptcha for DiscussionTools for the apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296533 (https://phabricator.wikimedia.org/T427887) (owner: 10Dreamy Jazz) [12:07:09] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1296532|hCaptcha: Deduplicate edit API detection code (T427887)]], [[gerrit:1296533|hCaptcha: Disable hCaptcha for DiscussionTools for the apps (T427887)]] [12:07:11] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2011,2033-2034,2050,2055-2062,2068-2071,2107-2113].codfw.wmnet [12:07:16] T427887: Cannot publish DiscussionTools reply on Android App - https://phabricator.wikimedia.org/T427887 [12:07:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T426633)', diff saved to https://phabricator.wikimedia.org/P93536 and previous config saved to /var/cache/conftool/dbconfig/20260602-120728-fceratto.json [12:07:48] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [12:07:51] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1059.eqiad.wmnet with reason: host reimage [12:07:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T426633)', diff saved to https://phabricator.wikimedia.org/P93537 and previous config saved to /var/cache/conftool/dbconfig/20260602-120755-fceratto.json [12:08:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#11976085 (10Jclark-ctr) # Dedicated dse-k8s workers for production WDQS in codfw - See #T425653 node /^dse-k8s-wdqs200[1-4]\.c... [12:08:18] (03CR) 10JMeybohm: [C:03+2] partman/reuse-raid10-6dev.cfg: Use linux-swap as fs identifier (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296553 (https://phabricator.wikimedia.org/T427088) (owner: 10JMeybohm) [12:09:00] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1296532|hCaptcha: Deduplicate edit API detection code (T427887)]], [[gerrit:1296533|hCaptcha: Disable hCaptcha for DiscussionTools for the apps (T427887)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:09:56] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Switch maintenance [12:09:59] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2161: Migration of db2161.codfw.wmnet completed [12:11:05] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lsw1-a3-codfw,lsw1-a3-codfw IPv6,lsw1-a3-codfw.mgmt with reason: Switch maintenance [12:11:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2049.codfw.wmnet with reason: host reimage [12:11:54] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [12:13:25] (03PS1) 10Reedy: Add a maintenance script to delete old files [extensions/timeline] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296560 [12:13:42] (03PS1) 10Reedy: Add a maintenance script to delete old files [extensions/timeline] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296561 [12:14:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T426633)', diff saved to https://phabricator.wikimedia.org/P93539 and previous config saved to /var/cache/conftool/dbconfig/20260602-121451-fceratto.json [12:16:11] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296532|hCaptcha: Deduplicate edit API detection code (T427887)]], [[gerrit:1296533|hCaptcha: Disable hCaptcha for DiscussionTools for the apps (T427887)]] (duration: 09m 02s) [12:16:18] T427887: Cannot publish DiscussionTools reply on Android App - https://phabricator.wikimedia.org/T427887 [12:17:34] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS trixie [12:18:39] (03PS3) 10Slyngshede: P:cache:haproxy add image generator information [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) [12:19:06] (03CR) 10Slyngshede: "Documentation created: https://wikitech.wikimedia.org/wiki/X-Image-Generator" [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) (owner: 10Slyngshede) [12:20:06] (03PS1) 10Bartosz Wójtowicz: ml-services: Bump outlink-topic-model image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296562 [12:20:06] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1058.eqiad.wmnet with OS trixie [12:20:47] (03PS1) 10Bartosz Dziewoński: Remove workaround for stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296563 (https://phabricator.wikimedia.org/T389433) [12:20:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2011,2033-2034,2050,2055-2062,2068-2071,2107-2113].codfw.wmnet [12:20:51] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1060.eqiad.wmnet with OS trixie [12:21:06] (03PS1) 10Btullis: Fix the hostnames for dse-k8s-wdqs100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1296564 (https://phabricator.wikimedia.org/T423314) [12:21:28] !log reboot lsw1-a3-codfw for software upgrade - T427301 [12:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:32] T427301: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301 [12:21:46] Shutdown at Tue Jun 2 12:22:36 2026 [12:22:35] (03CR) 10Btullis: [C:03+2] Fix the hostnames for dse-k8s-wdqs100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1296564 (https://phabricator.wikimedia.org/T423314) (owner: 10Btullis) [12:23:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:24:38] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1059.eqiad.wmnet with OS trixie [12:24:42] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:24:52] PROBLEM - BFD status on ssw1-a1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:24:52] PROBLEM - BFD status on ssw1-a8-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:25:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P93542 and previous config saved to /var/cache/conftool/dbconfig/20260602-122459-fceratto.json [12:25:27] (03PS1) 10Bartosz Dziewoński: Clean up bot password configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296566 [12:25:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296563 (https://phabricator.wikimedia.org/T389433) (owner: 10Bartosz Dziewoński) [12:25:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296566 (owner: 10Bartosz Dziewoński) [12:26:28] (03CR) 10Majavah: "there's a minor diff if you scroll to the file resource all the way down" [puppet] - 10https://gerrit.wikimedia.org/r/1296528 (owner: 10Muehlenhoff) [12:26:32] (03CR) 10Majavah: [C:04-1] Inline profile::mail::smarthost into profile::mail::smarthost::wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1296528 (owner: 10Muehlenhoff) [12:26:39] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a3-codfw (10.192.252.5) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:26:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/2 (Core: lsw1-a3-codfw:et-0/0/55 {#230403800027}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:27:19] (03CR) 10AikoChou: [C:03+1] ml-services: Bump outlink-topic-model image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296562 (owner: 10Bartosz Wójtowicz) [12:27:43] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Bump outlink-topic-model image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296562 (owner: 10Bartosz Wójtowicz) [12:28:14] jouncebot: nowandnext [12:28:14] For the next 0 hour(s) and 31 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1200) [12:28:14] In 0 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1300) [12:28:41] (03PS1) 10Kosta Harlan: hCaptcha: Remove apiUrl health check and APCu layer from health checker [extensions/ConfirmEdit] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296568 (https://phabricator.wikimedia.org/T421464) [12:28:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2049.codfw.wmnet with OS trixie [12:28:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:28:56] (03PS1) 10Slyngshede: Geo-maps: Update Meta mapping for June 2026 [dns] - 10https://gerrit.wikimedia.org/r/1296569 [12:28:56] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:29:29] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1061.eqiad.wmnet with OS trixie [12:29:46] (03Merged) 10jenkins-bot: ml-services: Bump outlink-topic-model image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296562 (owner: 10Bartosz Wójtowicz) [12:31:41] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [12:31:53] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2049: repool after upgrade [12:33:26] switch is back up [12:33:34] 11min downtime [12:33:40] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1060.eqiad.wmnet with reason: host reimage [12:33:51] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:35:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P93545 and previous config saved to /var/cache/conftool/dbconfig/20260602-123505-fceratto.json [12:35:35] (03PS7) 10Arnaudb: trafficserver: add a map for gitlab as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) [12:35:44] !log blake@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1060.eqiad.wmnet with reason: host reimage [12:35:44] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 33, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:35:54] RECOVERY - BFD status on ssw1-a8-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:35:54] RECOVERY - BFD status on ssw1-a1-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:36:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/2 (Core: lsw1-a3-codfw:et-0/0/55 {#230403800027}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:38:32] (03CR) 10Arnaudb: trafficserver: add a map for gitlab as a backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [12:38:43] (03PS1) 10Arnaudb: cache_text: add gitlab-https to realservers [puppet] - 10https://gerrit.wikimedia.org/r/1296572 (https://phabricator.wikimedia.org/T425441) [12:38:56] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:39:27] (03PS3) 10Anzx: cswiki: lift IP cap for workshop on 08-June-2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295574 (https://phabricator.wikimedia.org/T427678) [12:39:28] (03CR) 10Btullis: [C:03+2] Configure rsyslog to forward 'dumps_http' messages to Kafka [puppet] - 10https://gerrit.wikimedia.org/r/1287374 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [12:39:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295574 (https://phabricator.wikimedia.org/T427678) (owner: 10Anzx) [12:40:44] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:41:13] !log enable bgp graceful-shutdown in underlay on ssw1-a1-codfw T427301 [12:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:17] T427301: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301 [12:41:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a3-codfw (10.192.252.5) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:42:10] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1061.eqiad.wmnet with reason: host reimage [12:42:13] !log blake@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1061.eqiad.wmnet with reason: host reimage [12:42:52] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [12:43:26] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 1 (backup2013), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:43:29] !log blake@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc1060.eqiad.wmnet with OS trixie [12:43:32] (03PS1) 10Dpogorzelski: ml-serve: add node labels [puppet] - 10https://gerrit.wikimedia.org/r/1296574 [12:44:14] (03PS1) 10Majavah: P:syslog: centralserver: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1296577 [12:44:22] (03PS2) 10Dpogorzelski: ml-serve: add node labels [puppet] - 10https://gerrit.wikimedia.org/r/1296574 [12:45:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T426633)', diff saved to https://phabricator.wikimedia.org/P93547 and previous config saved to /var/cache/conftool/dbconfig/20260602-124512-fceratto.json [12:45:34] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [12:45:39] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#11976265 (10MoritzMuehlenhoff) [12:45:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T426633)', diff saved to https://phabricator.wikimedia.org/P93548 and previous config saved to /var/cache/conftool/dbconfig/20260602-124541-fceratto.json [12:46:06] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8629/co" [puppet] - 10https://gerrit.wikimedia.org/r/1296577 (owner: 10Majavah) [12:46:37] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: CPU1 thermal fault for wdqs1015.eqiad.wmnet - https://phabricator.wikimedia.org/T427852#11976268 (10Jclark-ctr) @RKemper @wiki_willy I have gone through all decommissioned servers and do not have a matching Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz availabl... [12:46:40] (03PS21) 10Ayounsi: Create cookbook to depool all services in a given rack [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) [12:47:18] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1062.eqiad.wmnet with OS trixie [12:48:15] !log ayounsi@cumin1003 START - Cookbook sre.hosts.remove-downtime for lsw1-a3-codfw,lsw1-a3-codfw IPv6,lsw1-a3-codfw.mgmt [12:48:17] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lsw1-a3-codfw,lsw1-a3-codfw IPv6,lsw1-a3-codfw.mgmt [12:49:55] !log blake@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mc1061.eqiad.wmnet with OS trixie [12:50:08] !log enable bgp graceful-shutdown in overlay on ssw1-a1-codfw T427301 [12:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:12] T427301: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301 [12:51:59] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-serve: add node labels [puppet] - 10https://gerrit.wikimedia.org/r/1296574 (owner: 10Dpogorzelski) [12:52:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T426633)', diff saved to https://phabricator.wikimedia.org/P93550 and previous config saved to /var/cache/conftool/dbconfig/20260602-125223-fceratto.json [12:52:55] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [12:53:01] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1295023 (owner: 10Elukey) [12:54:55] !log shutdown sub-interfaces on cr1-codfw et-1/1/5 for row A/B vlans T427301 [12:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:29] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1296577 (owner: 10Majavah) [12:55:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2161: Migration of db2161.codfw.wmnet completed [12:55:33] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:55:59] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: add node labels [puppet] - 10https://gerrit.wikimedia.org/r/1296574 (owner: 10Dpogorzelski) [12:57:36] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [12:57:40] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [12:57:44] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [12:57:51] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [12:58:19] I have a patch to deploy this window, I'll be back at my computer in 10-15 mins :) [12:59:02] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [12:59:48] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1062.eqiad.wmnet with reason: host reimage [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1300). [13:00:05] MatmaRex, Msz2001, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] (03CR) 10Kamila Součková: [C:03+1] "Noting down to remember to revert it. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1296036 (https://phabricator.wikimedia.org/T418200) (owner: 10Scott French) [13:00:13] hi [13:00:31] o/ [13:00:31] my config changes are all no-ops / cleanups, i need someone to deploy for me :) [13:00:45] (03PS1) 10Dreamy Jazz: Use the globalblock-local-status right over globalblock-whitelist [extensions/GlobalBlocking] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296582 (https://phabricator.wikimedia.org/T277942) [13:01:10] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: CPU1 thermal fault for wdqs1015.eqiad.wmnet - https://phabricator.wikimedia.org/T427852#11976331 (10Jclark-ctr) I did attempt the firmware updates, but after rebooting, the server became unresponsive and will not boot. At this point, I would need a compatibl... [13:01:49] (03PS10) 10Daniel Kinzler: Rakefile: Run chart specific tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282965 (https://phabricator.wikimedia.org/T424824) [13:02:25] RESOLVED: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P93553 and previous config saved to /var/cache/conftool/dbconfig/20260602-130230-fceratto.json [13:02:39] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Reimaging upstream servers [13:02:50] (03PS3) 10Anzx: Add kha to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296580 (https://phabricator.wikimedia.org/T427917) [13:03:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296580 (https://phabricator.wikimedia.org/T427917) (owner: 10Anzx) [13:03:12] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1062.eqiad.wmnet with reason: host reimage [13:03:12] !log increase OSPF cost on ssw1-a1-codfw et-0/0/2 towards lsw1-a3-codfw T427301 [13:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:17] T427301: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301 [13:03:18] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs1001.eqiad.wmnet with OS trixie [13:03:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#11976336 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host dse-k8s-wdqs1001... [13:03:38] (03CR) 10JMeybohm: [C:03+1] Rakefile: Run chart specific tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282965 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [13:03:43] I can deploy [13:03:44] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on clouddb[1022-1023].eqiad.wmnet with reason: Reimaging upstream servers [13:03:46] (03PS2) 10Mpostoronca: wmf-config: Skip CAPTCHA for action=mcrundo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) [13:04:19] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [13:04:21] (03CR) 10Mpostoronca: wmf-config: Skip CAPTCHA for action=mcrundo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) (owner: 10Mpostoronca) [13:04:22] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [13:04:41] (03PS11) 10Daniel Kinzler: Rakefile: Run chart specific tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282965 (https://phabricator.wikimedia.org/T424824) [13:04:56] !log jayme@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-main2006.codfw.wmnet with OS trixie [13:06:04] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1063.eqiad.wmnet with OS trixie [13:06:24] (03CR) 10Dreamy Jazz: [C:03+1] Clean up bot password configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296566 (owner: 10Bartosz Dziewoński) [13:06:35] Still looking over the changes [13:07:25] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs1002.eqiad.wmnet with OS trixie [13:07:27] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs1003.eqiad.wmnet with OS trixie [13:07:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#11976362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host dse-k8s-wdqs1002... [13:07:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#11976363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host dse-k8s-wdqs1003... [13:08:05] anzx: For your IP rate limit change is the start time as expected? [13:08:14] On the task I see 07:00 UTC+2 [13:08:30] But the throttle start time appears to be an hour earlier? [13:08:37] 06:00 +2:00 [13:09:00] yeah just to be safe , i set 1 hour early [13:09:13] (03CR) 10Dreamy Jazz: [C:03+1] "Going off the comment saying it's safe to remove, this looks fine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296563 (https://phabricator.wikimedia.org/T389433) (owner: 10Bartosz Dziewoński) [13:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:09:21] Sure, thanks [13:10:18] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:10:34] (03CR) 10Kamila Součková: [C:03+2] .fixtures: remove erroneously committed file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295949 (owner: 10Kamila Součková) [13:11:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński) [13:11:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński) [13:11:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296566 (owner: 10Bartosz Dziewoński) [13:11:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296563 (https://phabricator.wikimedia.org/T389433) (owner: 10Bartosz Dziewoński) [13:11:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295574 (https://phabricator.wikimedia.org/T427678) (owner: 10Anzx) [13:11:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/GlobalBlocking] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296582 (https://phabricator.wikimedia.org/T277942) (owner: 10Dreamy Jazz) [13:11:42] !log atsuko@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [13:11:46] Doing all but Msz2001's changes [13:11:54] (Msz2001 should be able to self deploy) [13:12:13] !log atsuko@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [13:12:16] (03CR) 10Majavah: [V:03+1 C:03+2] P:syslog: centralserver: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1296577 (owner: 10Majavah) [13:12:21] !log atsuko@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [13:12:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P93554 and previous config saved to /var/cache/conftool/dbconfig/20260602-131238-fceratto.json [13:12:48] !log atsuko@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [13:13:03] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [13:13:23] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1167: Upgrading db1167.eqiad.wmnet [13:13:28] gate-and-submit is slow today, so may be a while before it gets started [13:13:53] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1167: Upgrading db1167.eqiad.wmnet [13:14:16] thanks Dreamy_Jazz [13:14:22] (03PS2) 10Kamila Součková: CI: Fix CI pass on template render fail [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295947 (https://phabricator.wikimedia.org/T427307) [13:14:51] (03CR) 10Ssingh: [C:03+1] Geo-maps: Update Meta mapping for June 2026 [dns] - 10https://gerrit.wikimedia.org/r/1296569 (owner: 10Slyngshede) [13:14:58] (03CR) 10Kamila Součková: CI: Fix CI pass on template render fail (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295947 (https://phabricator.wikimedia.org/T427307) (owner: 10Kamila Součková) [13:15:06] Actually seems zuul has stopped processing [13:15:31] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:15:34] castor being castor [13:15:57] globalblocking I guess is just backed up [13:16:19] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1167.eqiad.wmnet with OS trixie [13:16:28] (03PS1) 10Slyngshede: P:cumin:master remove liberica alias for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1296587 [13:16:35] https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php83-selenium/57912/console is complete, but isn't being reflected as such in zuul [13:16:45] I'm back [13:17:00] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:17:06] Hi Msz2001, seems like zuul / CI is having a bad day [13:17:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2049: repool after upgrade [13:17:20] Still waiting for any gate-and-submit jobs to start [13:17:31] ouch... [13:17:52] An hour ago it worked [13:18:44] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to for - https://phabricator.wikimedia.org/T427553#11976402 (10Raine) [13:18:53] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1063.eqiad.wmnet with reason: host reimage [13:19:46] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1062.eqiad.wmnet with OS trixie [13:19:51] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS trixie [13:20:19] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1064.eqiad.wmnet with OS trixie [13:22:33] (03PS2) 10Kamila Součková: admin: add apdube-wmf user [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553) [13:22:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T426633)', diff saved to https://phabricator.wikimedia.org/P93557 and previous config saved to /var/cache/conftool/dbconfig/20260602-132246-fceratto.json [13:23:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [13:23:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1210 (T426633)', diff saved to https://phabricator.wikimedia.org/P93558 and previous config saved to /var/cache/conftool/dbconfig/20260602-132314-fceratto.json [13:23:49] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1063.eqiad.wmnet with reason: host reimage [13:24:40] !log increase OSPF cost on ssw1-a1-codfw et-0/0/4 towards lsw1-a5-codfw T427301 [13:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:43] T427301: codfw: rack A3 maintenance - https://phabricator.wikimedia.org/T427301 [13:25:05] Looks like zuul is completely stopped, I've posted in #wikimedia-releng and will see about getting it working again [13:25:29] (03CR) 10Slyngshede: [C:03+2] Geo-maps: Update Meta mapping for June 2026 [dns] - 10https://gerrit.wikimedia.org/r/1296569 (owner: 10Slyngshede) [13:25:38] !log slyngshede@dns1004 START - running authdns-update [13:26:16] PROBLEM - Host db2175 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:26:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#11976434 (10Jclark-ctr) {F86226390} These are Failing to image for preseed file [13:26:30] PROBLEM - Host backup2013 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:31] !ack [13:26:32] 8039 (ACKED) Host db2175 (paged) [13:26:38] PROBLEM - Host wikikube-worker2242 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:38] (03CR) 10Aqu: "Nice" [puppet] - 10https://gerrit.wikimedia.org/r/1295045 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [13:26:38] PROBLEM - Host wikikube-worker2243 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:38] PROBLEM - Host wikikube-worker2254 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:38] PROBLEM - Host wikikube-worker2255 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:42] (03CR) 10Dreamy Jazz: "While this is kinda hacky, it's not intended as a long term fix and will be removed once the interface supports it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) (owner: 10Mpostoronca) [13:26:44] PROBLEM - Host thanos-be2006 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:44] PROBLEM - Host puppetserver2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:46] (03CR) 10Dreamy Jazz: [C:03+1] wmf-config: Skip CAPTCHA for action=mcrundo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) (owner: 10Mpostoronca) [13:26:49] PROBLEM - Host es2050 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:26:50] PROBLEM - Host rdb2007 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:55] PROBLEM - Host db2154 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:26:55] PROBLEM - Host db2153 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:26:56] PROBLEM - Host db2157 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:26:57] uh? [13:27:04] wow [13:27:05] !ack [13:27:06] 8040 (ACKED) Host es2050 (paged) [13:27:06] 8041 (ACKED) Host db2154 (paged) [13:27:07] 8042 (ACKED) Host db2157 (paged) [13:27:09] PROBLEM - Host db2176 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:27:14] PROBLEM - Host wikikube-worker2017 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:14] PROBLEM - Host wikikube-worker2018 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:16] PROBLEM - Host wikikube-worker2041 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:16] PROBLEM - Host wikikube-worker2013 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:16] PROBLEM - Host wikikube-worker2014 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:16] PROBLEM - Host wikikube-worker2051 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:16] PROBLEM - Host wikikube-worker2044 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:16] PROBLEM - Host wikikube-worker2012 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:16] PROBLEM - Host wikikube-worker2074 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:17] PROBLEM - Host wikikube-worker2075 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:17] PROBLEM - Host wikikube-worker2092 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:18] PROBLEM - Host wikikube-worker2076 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:18] PROBLEM - Host wikikube-worker2091 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:19] PROBLEM - Host wikikube-worker2077 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:19] PROBLEM - Host wikikube-worker2078 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:20] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:21] !log slyngshede@dns1004 END - running authdns-update [13:27:24] a the rack going down? [13:27:35] RECOVERY - Host db2176 #page is UP: PING OK - Packet loss = 0%, RTA = 35.04 ms [13:27:35] that's us [13:27:41] RECOVERY - Host db2154 #page is UP: PING OK - Packet loss = 0%, RTA = 32.95 ms [13:27:41] RECOVERY - Host db2153 #page is UP: PING OK - Packet loss = 0%, RTA = 32.87 ms [13:27:42] RECOVERY - Host rdb2007 is UP: PING OK - Packet loss = 0%, RTA = 33.00 ms [13:27:42] A5? [13:27:42] RECOVERY - Host db2157 #page is UP: PING OK - Packet loss = 0%, RTA = 33.50 ms [13:27:42] RECOVERY - Host wikikube-worker2017 is UP: PING OK - Packet loss = 0%, RTA = 32.93 ms [13:27:42] RECOVERY - Host wikikube-worker2018 is UP: PING OK - Packet loss = 0%, RTA = 32.90 ms [13:27:43] yeah, spine maintenance issue [13:27:46] RECOVERY - Host wikikube-worker2014 is UP: PING OK - Packet loss = 0%, RTA = 32.93 ms [13:27:46] RECOVERY - Host wikikube-worker2044 is UP: PING OK - Packet loss = 0%, RTA = 32.88 ms [13:27:46] RECOVERY - Host wikikube-worker2051 is UP: PING OK - Packet loss = 0%, RTA = 32.83 ms [13:27:46] RECOVERY - Host wikikube-worker2012 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms [13:27:46] RECOVERY - Host wikikube-worker2013 is UP: PING OK - Packet loss = 0%, RTA = 32.88 ms [13:27:46] RECOVERY - Host wikikube-worker2041 is UP: PING OK - Packet loss = 0%, RTA = 32.92 ms [13:27:46] RECOVERY - Host wikikube-worker2075 is UP: PING OK - Packet loss = 0%, RTA = 32.93 ms [13:27:47] ohphew [13:27:47] RECOVERY - Host wikikube-worker2076 is UP: PING OK - Packet loss = 0%, RTA = 32.99 ms [13:27:47] RECOVERY - Host wikikube-worker2091 is UP: PING OK - Packet loss = 0%, RTA = 32.83 ms [13:27:48] RECOVERY - Host backup2013 is UP: PING OK - Packet loss = 0%, RTA = 32.92 ms [13:27:48] RECOVERY - Host wikikube-worker2092 is UP: PING OK - Packet loss = 0%, RTA = 32.85 ms [13:27:49] RECOVERY - Host wikikube-worker2078 is UP: PING OK - Packet loss = 0%, RTA = 33.54 ms [13:27:49] RECOVERY - Host wikikube-worker2074 is UP: PING OK - Packet loss = 0%, RTA = 32.84 ms [13:27:50] RECOVERY - Host db2175 #page is UP: PING OK - Packet loss = 0%, RTA = 33.52 ms [13:27:50] RECOVERY - Host wikikube-worker2077 is UP: PING OK - Packet loss = 0%, RTA = 37.13 ms [13:27:51] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 33.01 ms [13:27:51] jclark@cumin1003 reimage (PID 16585) is awaiting input [13:27:52] RECOVERY - Host es2050 #page is UP: PING OK - Packet loss = 0%, RTA = 33.01 ms [13:27:55] yeah, change applied to one rack went fine, but not the other [13:27:55] oof [13:28:01] !ack [13:28:01] All incidents are already acked. [13:28:06] RECOVERY - Host wikikube-worker2255 is UP: PING OK - Packet loss = 0%, RTA = 32.84 ms [13:28:06] RECOVERY - Host wikikube-worker2243 is UP: PING OK - Packet loss = 0%, RTA = 32.83 ms [13:28:06] RECOVERY - Host wikikube-worker2254 is UP: PING OK - Packet loss = 0%, RTA = 33.28 ms [13:28:06] RECOVERY - Host wikikube-worker2242 is UP: PING OK - Packet loss = 0%, RTA = 32.86 ms [13:28:11] free cardio in the morning, sitting. nice :D [13:28:14] RECOVERY - Host puppetserver2002 is UP: PING OK - Packet loss = 0%, RTA = 32.94 ms [13:28:14] RECOVERY - Host thanos-be2006 is UP: PING OK - Packet loss = 0%, RTA = 32.88 ms [13:28:58] for what it's worth it's not a full failure, but most likely the monitoring host to that rack lost connectivity [13:29:13] (03CR) 10Ladsgroup: profile::firewall: Allow to provide more fine-grained access from monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296251 (owner: 10Muehlenhoff) [13:29:19] I'm looking at the DB metrics [13:29:20] some other pings to/from that rack were still fine, we're investigating [13:29:41] jclark@cumin1003 reimage (PID 16602) is awaiting input [13:30:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T426633)', diff saved to https://phabricator.wikimedia.org/P93559 and previous config saved to /var/cache/conftool/dbconfig/20260602-132959-fceratto.json [13:31:13] indeed on the DB side they don't seem to have lost connectivity for a significant amount of time [13:31:24] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1167.eqiad.wmnet with reason: host reimage [13:31:45] FIRING: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:32:45] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1064.eqiad.wmnet with reason: host reimage [13:33:09] I got a backup timeout at :27 [13:33:14] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub: apply [13:33:18] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub: apply [13:34:04] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-toolhub: apply [13:34:10] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-toolhub: apply [13:35:13] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs1003.eqiad.wmnet with OS trixie [13:35:18] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs1002.eqiad.wmnet with OS trixie [13:35:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#11976478 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host dse-k8s-wdqs1003.eqi... [13:35:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#11976479 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host dse-k8s-wdqs1002.eqi... [13:35:57] federico3: all good for the DBs? [13:36:54] is it related to the maintenance or unrelated? [13:37:27] yes, they don't show drops in traffic [13:37:40] jynus: related [13:37:42] it's related to the maintenance yes, occurred after we de-preffed the link from ssw1-a1-codfw to lsw1-a5-codfw [13:37:58] though we did not expect that to interrupt things [13:38:07] yeah, np [13:38:12] the change was rolled back after which we got the recoveries [13:38:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11976487 (10Jclark-ctr) @colewhite can this be swapped at any time would you be able to rebuild after swapping? [13:38:14] was the work itself finished? [13:38:15] sorry folks <3 [13:38:18] ah, I got my answer [13:38:21] yes [13:38:27] we have to re-think this [13:38:37] I just want to know if to wait a bit before retrying the long running backups [13:38:39] * urbanecm would like to do a (no-op) config change, waiting for things to settle down [13:38:43] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1167.eqiad.wmnet with reason: host reimage [13:39:15] I am not affected by the interrumption a lot service-wise, just for retries on ongoing maintenance [13:39:29] so waiting for a green light to retry [13:40:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P93560 and previous config saved to /var/cache/conftool/dbconfig/20260602-134007-fceratto.json [13:40:35] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553) (owner: 10Kamila Součková) [13:40:37] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1063.eqiad.wmnet with OS trixie [13:42:52] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1064.eqiad.wmnet with reason: host reimage [13:43:48] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1065.eqiad.wmnet with OS trixie [13:44:50] MatmaRex: anzx: I need to go shortly, so as CI is still blocked I won't be able to do your backports [13:44:54] 06SRE, 10SRE-Access-Requests: Requesting access to Cassandra staging for akhatun - https://phabricator.wikimedia.org/T427701#11976504 (10Raine) [13:44:56] (03CR) 10Muehlenhoff: profile::firewall: Allow to provide more fine-grained access from monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296251 (owner: 10Muehlenhoff) [13:45:06] (03PS4) 10Effie Mouzeli: site.pp: add rdb2013 and rdb2014 [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) [13:45:14] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Tue 30 Jun 2026 01:10:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [13:45:25] (03PS1) 10JMeybohm: partman/reuse-raid10-6dev.cfg: Apply workaround to swap handling affecting trixie installations [puppet] - 10https://gerrit.wikimedia.org/r/1296597 (https://phabricator.wikimedia.org/T427088) [13:45:35] (03CR) 10CI reject: [V:04-1] site.pp: add rdb2013 and rdb2014 [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [13:45:40] Maybe Msz2001 can you handle these changes? [13:45:45] Dreamy_Jazz: no problem [13:45:46] (03CR) 10CI reject: [V:04-1] partman/reuse-raid10-6dev.cfg: Apply workaround to swap handling affecting trixie installations [puppet] - 10https://gerrit.wikimedia.org/r/1296597 (https://phabricator.wikimedia.org/T427088) (owner: 10JMeybohm) [13:45:48] (03CR) 10Effie Mouzeli: "I removed the preseed addition, by public demand" [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [13:45:51] (03CR) 10Ladsgroup: [C:03+1] profile::firewall: Allow to provide more fine-grained access from monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296251 (owner: 10Muehlenhoff) [13:45:52] I can [13:45:57] i can reschedule if we don't have time. i didn't have anything important [13:46:09] 06SRE, 10SRE-Access-Requests: Requesting access to Cassandra staging for akhatun - https://phabricator.wikimedia.org/T427701#11976521 (10Raine) @Ahoelzl can you please approve? Thanks! [13:46:13] Thanks, see you all around o/ [13:46:19] (03CR) 10JMeybohm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1296597 (https://phabricator.wikimedia.org/T427088) (owner: 10JMeybohm) [13:46:37] (03CR) 10Effie Mouzeli: site.pp: add rdb2013 and rdb2014 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [13:46:45] (03PS5) 10Effie Mouzeli: site.pp: add rdb2013 and rdb2014 [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) [13:47:06] (03CR) 10CI reject: [V:04-1] site.pp: add rdb2013 and rdb2014 [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [13:47:18] !log revert all config to normal on cr1-codfw and ssw1-a1-codfw [13:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:48] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:50:08] 06SRE, 10SRE-Access-Requests: Requesting access to Cassandra staging for akhatun - https://phabricator.wikimedia.org/T427701#11976541 (10Raine) @KOfori can you please approve access as group approver? Thank you! [13:50:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P93561 and previous config saved to /var/cache/conftool/dbconfig/20260602-135015-fceratto.json [13:50:52] (03CR) 10Ladsgroup: [C:03+1] profile::firewall: Allow to provide more fine-grained access from monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296251 (owner: 10Muehlenhoff) [13:51:06] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [13:51:48] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:51:51] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [13:52:39] (03PS6) 10Effie Mouzeli: site.pp: add rdb2013 and rdb2014 [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) [13:52:58] (03CR) 10CI reject: [V:04-1] site.pp: add rdb2013 and rdb2014 [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [13:54:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11976559 (10colewhite) >>! In T427748#11976487, @Jclark-ctr wrote: > @colewhite can this be swapped at any time would you be able to rebuild after swapping? Yes, I can do the... [13:54:47] (03CR) 10JMeybohm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1296597 (https://phabricator.wikimedia.org/T427088) (owner: 10JMeybohm) [13:54:48] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:55:04] jclark@cumin1003 reimage (PID 13706) is awaiting input [13:55:11] (03CR) 10Muehlenhoff: profile::firewall: Allow to provide more fine-grained access from monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296251 (owner: 10Muehlenhoff) [13:55:38] (03CR) 10Ladsgroup: [C:03+1] profile::firewall: Allow to provide more fine-grained access from monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296251 (owner: 10Muehlenhoff) [13:55:54] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1167.eqiad.wmnet with OS trixie [13:56:27] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1065.eqiad.wmnet with reason: host reimage [13:56:38] (03CR) 10Ladsgroup: [C:03+1] profile::firewall: Allow to provide more fine-grained access from monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1296251 (owner: 10Muehlenhoff) [13:56:48] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:59:21] (03CR) 10Aqu: Add commonswiki globalimagelinks monthly sqoop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1295045 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [14:00:00] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1064.eqiad.wmnet with OS trixie [14:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1400) [14:00:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T426633)', diff saved to https://phabricator.wikimedia.org/P93562 and previous config saved to /var/cache/conftool/dbconfig/20260602-140022-fceratto.json [14:00:24] !log ayounsi@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2011,2033-2034,2050,2055-2062,2068-2071,2107-2113].codfw.wmnet [14:00:38] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2011,2033-2034,2050,2055-2062,2068-2071,2107-2113].codfw.wmnet [14:00:45] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1065.eqiad.wmnet with reason: host reimage [14:01:11] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1066.eqiad.wmnet with OS trixie [14:01:13] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [14:01:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:01:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1156 (T426633)', diff saved to https://phabricator.wikimedia.org/P93563 and previous config saved to /var/cache/conftool/dbconfig/20260602-140140-fceratto.json [14:01:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:02:33] federico3: you can repool servers for https://phabricator.wikimedia.org/T427301 [14:02:40] thanks [14:03:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11976599 (10ayounsi) [14:04:53] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [14:05:26] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [14:06:48] (03PS1) 10Jcrespo: dbbackups: Testing x1 backups on new cumin2003 trixie host [puppet] - 10https://gerrit.wikimedia.org/r/1296602 (https://phabricator.wikimedia.org/T427897) [14:06:48] (03CR) 10JMeybohm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1296597 (https://phabricator.wikimedia.org/T427088) (owner: 10JMeybohm) [14:07:08] (03CR) 10CI reject: [V:04-1] dbbackups: Testing x1 backups on new cumin2003 trixie host [puppet] - 10https://gerrit.wikimedia.org/r/1296602 (https://phabricator.wikimedia.org/T427897) (owner: 10Jcrespo) [14:07:23] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1296602 (https://phabricator.wikimedia.org/T427897) (owner: 10Jcrespo) [14:08:42] !log urbanecm@deploy1003 mwscript-k8s job started: foreachwikiindblist growthexperiments userOptions.php --delete growthexperiments-homepage-variant # T417621 [14:08:45] T417621: Remove 'growthexperiments-homepage-variant' user property from all wikis where it's present - https://phabricator.wikimedia.org/T417621 [14:08:46] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1296597 (https://phabricator.wikimedia.org/T427088) (owner: 10JMeybohm) [14:09:04] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1048.eqiad.wmnet [14:09:05] !log urbanecm@deploy1003 mwscript-k8s job started: foreachwikiindblist growthexperiments userOptions.php --delete --nowarn growthexperiments-homepage-variant # T417621 [14:09:59] (03CR) 10JMeybohm: [V:03+2 C:03+2] partman/reuse-raid10-6dev.cfg: Apply workaround to swap handling affecting trixie installations [puppet] - 10https://gerrit.wikimedia.org/r/1296597 (https://phabricator.wikimedia.org/T427088) (owner: 10JMeybohm) [14:10:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T426633)', diff saved to https://phabricator.wikimedia.org/P93564 and previous config saved to /var/cache/conftool/dbconfig/20260602-141019-fceratto.json [14:11:08] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: re-rack mc2055 (before Jun 9th) - https://phabricator.wikimedia.org/T427373#11976649 (10Jhancock.wm) @jijiki i'm ready whenever you are to do the move. should only take about 20-30 minutes. I do have a meeting this morning bu... [14:13:55] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1066.eqiad.wmnet with reason: host reimage [14:14:28] MatmaRex, anzx: I'll reschedule my patches from today's window for tomorrow UTC morning. I can deploy yours as well at that time if you're okay with that (they seem trivial enough that they need no verification or that I can verify them myself) [14:14:31] jayme@cumin2002 reimage (PID 3580405) is awaiting input [14:14:43] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1296602 (https://phabricator.wikimedia.org/T427897) (owner: 10Jcrespo) [14:14:51] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [14:15:08] Msz2001: sure, that's cool with me [14:15:11] Msz2001: ok, thanks [14:15:28] if anything turns out to not be trivial, i'll reschedule it :) [14:15:36] thanks [14:15:48] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2006.codfw.wmnet with OS trixie [14:15:56] yw [14:16:46] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1065.eqiad.wmnet with OS trixie [14:17:03] !log cwilliams@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1167.eqiad.wmnet [14:17:04] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1167.eqiad.wmnet [14:17:20] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS trixie [14:17:22] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1067.eqiad.wmnet with OS trixie [14:18:36] (03CR) 10Mforns: Add filerevision to the mediawiki not-history sqoop (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1295047 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [14:19:30] (03CR) 10Mforns: "I think if we use sqoopable_dblist in the other change, then we don't need this change at all, no?" [puppet] - 10https://gerrit.wikimedia.org/r/1295045 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [14:20:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P93566 and previous config saved to /var/cache/conftool/dbconfig/20260602-142027-fceratto.json [14:20:29] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1066.eqiad.wmnet with reason: host reimage [14:20:41] jiji@cumin1003 decommission (PID 74967) is awaiting input [14:21:18] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1167: Repooling after Icing wait-for-green timeout [14:23:32] (03CR) 10Lucas Werkmeister (WMDE): "(removing CR+2 so this doesn’t get merged accidentally without being deployed; AFAICT from a glance at the IRC backscroll, Zuul / gate-and" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński) [14:23:41] (03CR) 10CI reject: [V:04-1] Remove unused 'writeapi' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński) [14:23:42] Dreamy_Jazz: ^ fyi, I hope that’s okay [14:23:51] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 33, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:25:00] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1048.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [14:26:33] (03CR) 10Mszwarc: [C:03+1] "Removed CR+2 as it didn't get deployed due to CI problems. Let's not have dangling +2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński) [14:26:42] (03CR) 10CI reject: [V:04-1] Revert "labswiki: Disallow account autocreation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński) [14:26:46] (03CR) 10Mszwarc: [C:03+1] "Removed CR+2 as it didn't get deployed due to CI problems. Let's not have dangling +2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296563 (https://phabricator.wikimedia.org/T389433) (owner: 10Bartosz Dziewoński) [14:26:55] (03CR) 10CI reject: [V:04-1] Remove workaround for stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296563 (https://phabricator.wikimedia.org/T389433) (owner: 10Bartosz Dziewoński) [14:26:59] (03CR) 10Mszwarc: [C:03+1] "Removed CR+2 as it didn't get deployed due to CI problems. Let's not have dangling +2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296566 (owner: 10Bartosz Dziewoński) [14:27:02] (03PS1) 10Btullis: dumps: http: Stop prepending the hostname to the syslog events [puppet] - 10https://gerrit.wikimedia.org/r/1296605 (https://phabricator.wikimedia.org/T425087) [14:27:07] (03CR) 10CI reject: [V:04-1] Clean up bot password configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296566 (owner: 10Bartosz Dziewoński) [14:27:21] (03CR) 10CI reject: [V:04-1] dumps: http: Stop prepending the hostname to the syslog events [puppet] - 10https://gerrit.wikimedia.org/r/1296605 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:27:50] (03CR) 10Mszwarc: [C:03+1] "Removed CR+2 as it didn't get deployed due to CI problems. Let's not have dangling +2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295574 (https://phabricator.wikimedia.org/T427678) (owner: 10Anzx) [14:27:58] (03CR) 10CI reject: [V:04-1] cswiki: lift IP cap for workshop on 08-June-2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295574 (https://phabricator.wikimedia.org/T427678) (owner: 10Anzx) [14:28:05] jiji@cumin1003 decommission (PID 74967) is awaiting input [14:28:24] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295574 (https://phabricator.wikimedia.org/T427678) (owner: 10Anzx) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1430) [14:30:19] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1067.eqiad.wmnet with reason: host reimage [14:30:27] (03PS2) 10Btullis: dumps: http: Stop prepending the hostname to the syslog events [puppet] - 10https://gerrit.wikimedia.org/r/1296605 (https://phabricator.wikimedia.org/T425087) [14:30:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P93569 and previous config saved to /var/cache/conftool/dbconfig/20260602-143035-fceratto.json [14:30:46] (03CR) 10CI reject: [V:04-1] dumps: http: Stop prepending the hostname to the syslog events [puppet] - 10https://gerrit.wikimedia.org/r/1296605 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:31:53] (03PS3) 10Btullis: dumps: http: Stop prepending the hostname to the syslog events [puppet] - 10https://gerrit.wikimedia.org/r/1296605 (https://phabricator.wikimedia.org/T425087) [14:32:17] (03CR) 10CI reject: [V:04-1] dumps: http: Stop prepending the hostname to the syslog events [puppet] - 10https://gerrit.wikimedia.org/r/1296605 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:32:18] (03CR) 10Ssingh: [C:03+1] P:cumin:master remove liberica alias for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1296587 (owner: 10Slyngshede) [14:32:44] (03CR) 10CI reject: [V:04-1] P:cumin:master remove liberica alias for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1296587 (owner: 10Slyngshede) [14:33:57] (03PS4) 10Btullis: dumps: http: Stop prepending the hostname to the syslog events [puppet] - 10https://gerrit.wikimedia.org/r/1296605 (https://phabricator.wikimedia.org/T425087) [14:34:16] (03CR) 10CI reject: [V:04-1] dumps: http: Stop prepending the hostname to the syslog events [puppet] - 10https://gerrit.wikimedia.org/r/1296605 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:34:49] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:34:51] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1067.eqiad.wmnet with reason: host reimage [14:35:37] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:35:37] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:36:49] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1066.eqiad.wmnet with OS trixie [14:37:24] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1068.eqiad.wmnet with OS trixie [14:37:31] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1048.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [14:37:32] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:37:33] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1048.eqiad.wmnet [14:37:37] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:37:37] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:37:58] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [14:38:38] (03PS2) 10Urbanecm: [Growth] Set wgGEMentorshipCleanupEnabled to false on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296514 (https://phabricator.wikimedia.org/T427386) [14:38:42] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver: apply [14:38:45] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver: apply [14:38:48] (03CR) 10CI reject: [V:04-1] [Growth] Set wgGEMentorshipCleanupEnabled to false on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296514 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [14:38:49] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:38:57] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [14:39:31] (03CR) 10Ssingh: [C:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1296587 (owner: 10Slyngshede) [14:40:40] jiji@cumin1003 decommission (PID 99366) is awaiting input [14:40:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T426633)', diff saved to https://phabricator.wikimedia.org/P93571 and previous config saved to /var/cache/conftool/dbconfig/20260602-144043-fceratto.json [14:40:49] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [14:41:02] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2158: Repooling [14:41:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [14:41:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1182 (T426633)', diff saved to https://phabricator.wikimedia.org/P93573 and previous config saved to /var/cache/conftool/dbconfig/20260602-144110-fceratto.json [14:41:20] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool pc2021: Repooling [14:41:20] !log fceratto@cumin1003 START - Cookbook sre.mysql.parsercache [14:41:35] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [14:41:35] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2021: Repooling [14:41:45] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet [14:41:45] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet [14:41:55] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2250.codfw.wmnet [14:41:55] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2250.codfw.wmnet [14:42:04] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for pc2021.codfw.wmnet [14:42:04] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for pc2021.codfw.wmnet [14:42:38] jouncebot nowandnext [14:42:38] For the next 0 hour(s) and 17 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1430) [14:42:38] In 0 hour(s) and 17 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1500) [14:42:49] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:42:49] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:43:24] (03CR) 10Lucas Werkmeister (WMDE): "I think we should split this up… first add the new `'msg'` keys, deploy that, make the Wikibase changes depend on that and merge those, th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295978 (https://phabricator.wikimedia.org/T427804) (owner: 10Audrey Penven) [14:43:27] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 1 (backup2013), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [14:44:30] (03PS1) 10TrainBranchBot: testwikis to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296606 (https://phabricator.wikimedia.org/T423914) [14:44:33] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296606 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [14:44:44] (03CR) 10CI reject: [V:04-1] testwikis to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296606 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [14:44:50] (03CR) 10EMcFarland: [C:03+1] "Looks good, but even if I could +2 this, I'd want someone with more Puppet experience to do the final +2." [puppet] - 10https://gerrit.wikimedia.org/r/1296519 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [14:45:30] (03Abandoned) 10Ahmon Dancy: testwikis to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296606 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [14:45:44] (03CR) 10EMcFarland: [C:03+1] "Looking good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296514 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [14:48:26] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1296602 (https://phabricator.wikimedia.org/T427897) (owner: 10Jcrespo) [14:49:33] (03PS1) 10Btullis: kafka event platform logs - Strip the stray $!msg field [puppet] - 10https://gerrit.wikimedia.org/r/1296607 (https://phabricator.wikimedia.org/T291645) [14:49:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T426633)', diff saved to https://phabricator.wikimedia.org/P93575 and previous config saved to /var/cache/conftool/dbconfig/20260602-144935-fceratto.json [14:50:17] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1068.eqiad.wmnet with reason: host reimage [14:50:23] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11976839 (10Ademola) Implementation is ready and tested. For batches under 200 files, maxSimultaneousReq is now set to 4; larger batches remain at 2. Source cod... [14:50:52] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [14:51:04] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1067.eqiad.wmnet with OS trixie [14:51:12] (03CR) 10Jcrespo: [C:03+2] dbbackups: Testing x1 backups on new cumin2003 trixie host [puppet] - 10https://gerrit.wikimedia.org/r/1296602 (https://phabricator.wikimedia.org/T427897) (owner: 10Jcrespo) [14:51:35] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1296605 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:51:41] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1296607 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [14:51:45] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1296605 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [14:52:01] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1049.eqiad.wmnet [14:52:22] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ttmserver: apply [14:52:28] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver: apply [14:53:44] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1296587 (owner: 10Slyngshede) [14:54:31] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296514 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [14:54:32] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1068.eqiad.wmnet with reason: host reimage [14:56:22] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#11976878 (10MoritzMuehlenhoff) [14:57:52] (03CR) 10Slyngshede: [C:03+2] P:cumin:master remove liberica alias for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1296587 (owner: 10Slyngshede) [14:58:36] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553) (owner: 10Kamila Součková) [14:59:06] (03CR) 10Effie Mouzeli: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [14:59:09] (03CR) 10Hashar: "recheck CI had some issue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński) [14:59:13] (03CR) 10Hashar: "recheck CI had some issue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296563 (https://phabricator.wikimedia.org/T389433) (owner: 10Bartosz Dziewoński) [14:59:16] (03CR) 10Hashar: "recheck CI had some issue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296566 (owner: 10Bartosz Dziewoński) [14:59:20] (03CR) 10Hashar: "recheck CI had some issue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295574 (https://phabricator.wikimedia.org/T427678) (owner: 10Anzx) [14:59:25] (03CR) 10CI reject: [V:04-1] admin: add apdube-wmf user [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553) (owner: 10Kamila Součková) [14:59:32] jouncebot: nowandnext [14:59:32] For the next 0 hour(s) and 0 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1430) [14:59:32] In 0 hour(s) and 0 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1500) [14:59:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P93578 and previous config saved to /var/cache/conftool/dbconfig/20260602-145943-fceratto.json [15:00:05] jelto, arnoldokoth, mutante, and arnaudb: May I have your attention please! SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1500) [15:00:21] (03CR) 10Urbanecm: [C:03+2] [Growth] Set wgGEMentorshipCleanupEnabled to false on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296514 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [15:00:44] (Spiderpig isn't working for me) [15:01:15] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [15:01:27] Dreamy_Jazz: tbh i switched back to deploy once i got it to the "it broke when i opened my job" state [15:01:59] (03Merged) 10jenkins-bot: [Growth] Set wgGEMentorshipCleanupEnabled to false on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296514 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [15:02:01] (03PS1) 10Atsuko: services_proxy: switch to prod opensearch-on-k8s services [puppet] - 10https://gerrit.wikimedia.org/r/1296608 (https://phabricator.wikimedia.org/T424248) [15:02:05] I assume you are deploying now? [15:02:20] If so I'll go in the queue behind you [15:02:20] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1069.eqiad.wmnet with OS trixie [15:02:43] Dreamy_Jazz: yep, it should be quick [15:02:46] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1296514|[Growth] Set wgGEMentorshipCleanupEnabled to false on all wikis (T427386)]] [15:02:53] T427386: Deploy automated mentor list cleanup to Wikimedia wikis - https://phabricator.wikimedia.org/T427386 [15:04:22] (03CR) 10Blake: [C:03+1] site.pp: add rdb2013 and rdb2014 [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [15:05:04] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897#11976933 (10jcrespo) I tested remote backups, and packages seem to be in a working state, but cumin (a dependency) seem to not be working well or lacking extra setup. No worries,... [15:05:33] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: add rdb2013 and rdb2014 [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [15:05:47] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1049.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [15:05:57] (03PS1) 10Scott French: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296585 [15:06:06] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1049.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [15:06:06] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:07] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1049.eqiad.wmnet [15:06:14] (03PS1) 10Jcrespo: Revert "dbbackups: Testing x1 backups on new cumin2003 trixie host" [puppet] - 10https://gerrit.wikimedia.org/r/1296611 [15:06:43] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1050.eqiad.wmnet [15:06:47] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1167: Repooling after Icing wait-for-green timeout [15:07:13] (03PS1) 10Ottomata: mw-content-history-reconcile-enrich-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296612 (https://phabricator.wikimedia.org/T421237) [15:08:10] (03CR) 10Kamila Součková: [C:03+1] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296585 (owner: 10Scott French) [15:08:14] (03CR) 10A-pizzata: [C:03+1] "LGTM, thx a lot!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296612 (https://phabricator.wikimedia.org/T421237) (owner: 10Ottomata) [15:08:55] (03CR) 10Jcrespo: "CCing Moritz for awereness (I commented on the ticket too), although I am sure he is aware." [puppet] - 10https://gerrit.wikimedia.org/r/1296611 (owner: 10Jcrespo) [15:08:59] (03CR) 10Jcrespo: [C:03+2] Revert "dbbackups: Testing x1 backups on new cumin2003 trixie host" [puppet] - 10https://gerrit.wikimedia.org/r/1296611 (owner: 10Jcrespo) [15:09:08] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296514|[Growth] Set wgGEMentorshipCleanupEnabled to false on all wikis (T427386)]] (duration: 06m 22s) [15:09:12] T427386: Deploy automated mentor list cleanup to Wikimedia wikis - https://phabricator.wikimedia.org/T427386 [15:09:16] Dreamy_Jazz: i'm done, over to you [15:09:28] Thanks [15:09:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P93580 and previous config saved to /var/cache/conftool/dbconfig/20260602-150951-fceratto.json [15:09:54] Going to retry the spiderpig deploy I was doing in the window including the changes in said window [15:10:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński) [15:10:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński) [15:10:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296566 (owner: 10Bartosz Dziewoński) [15:10:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296563 (https://phabricator.wikimedia.org/T389433) (owner: 10Bartosz Dziewoński) [15:10:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295574 (https://phabricator.wikimedia.org/T427678) (owner: 10Anzx) [15:10:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/GlobalBlocking] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296582 (https://phabricator.wikimedia.org/T277942) (owner: 10Dreamy Jazz) [15:10:25] (03CR) 10Urbanecm: [C:04-1] "Needs the MW patch to be both merged and deployed" [puppet] - 10https://gerrit.wikimedia.org/r/1296519 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [15:10:48] 06SRE, 06ServiceOps new: Build httpbb for Trixie - https://phabricator.wikimedia.org/T427899#11977012 (10MLechvien-WMF) p:05Triage→03Medium a:03RLazarus [15:11:27] (03PS1) 10JMeybohm: partman/reuse-raid10-6dev.cfg: Apply swap workaround [puppet] - 10https://gerrit.wikimedia.org/r/1296613 (https://phabricator.wikimedia.org/T427088) [15:11:39] (03Merged) 10jenkins-bot: Revert "labswiki: Disallow account autocreation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński) [15:11:42] (03Merged) 10jenkins-bot: Remove unused 'writeapi' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński) [15:11:44] (03Merged) 10jenkins-bot: Clean up bot password configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296566 (owner: 10Bartosz Dziewoński) [15:11:52] (03Merged) 10jenkins-bot: Use the globalblock-local-status right over globalblock-whitelist [extensions/GlobalBlocking] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296582 (https://phabricator.wikimedia.org/T277942) (owner: 10Dreamy Jazz) [15:12:01] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1068.eqiad.wmnet with OS trixie [15:12:26] jayme@cumin2002 reimage (PID 3592644) is awaiting input [15:12:27] (03CR) 10JMeybohm: [C:03+2] partman/reuse-raid10-6dev.cfg: Apply swap workaround [puppet] - 10https://gerrit.wikimedia.org/r/1296613 (https://phabricator.wikimedia.org/T427088) (owner: 10JMeybohm) [15:12:45] !log jayme@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-main2006.codfw.wmnet with OS trixie [15:12:47] (03Merged) 10jenkins-bot: Remove workaround for stuck session cookies on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296563 (https://phabricator.wikimedia.org/T389433) (owner: 10Bartosz Dziewoński) [15:12:51] (03Merged) 10jenkins-bot: cswiki: lift IP cap for workshop on 08-June-2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295574 (https://phabricator.wikimedia.org/T427678) (owner: 10Anzx) [15:12:57] (03CR) 10Ottomata: [C:03+2] mw-content-history-reconcile-enrich-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296612 (https://phabricator.wikimedia.org/T421237) (owner: 10Ottomata) [15:13:23] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1295502|Revert "labswiki: Disallow account autocreation"]], [[gerrit:1283106|Remove unused 'writeapi' right]], [[gerrit:1296566|Clean up bot password configuration]], [[gerrit:1296563|Remove workaround for stuck session cookies on Wikitech (T389433)]], [[gerrit:1295574|cswiki: lift IP cap for workshop on 08-June-2026 (T427678)]], [[gerrit:1296582|Us [15:13:23] e the globalblock-local-status right over globalblock-whitelist (T277942)]] [15:13:27] T389433: Fix stuck old cookies on Wikitech - https://phabricator.wikimedia.org/T389433 [15:13:28] T427678: Lift IP cap on 2026-06-08 for Czech Wikipedia workshop - cs.wikipedia - https://phabricator.wikimedia.org/T427678 [15:13:28] T277942: Address Voice and Tone issues in GlobalBlocking - https://phabricator.wikimedia.org/T277942 [15:14:24] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [15:14:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:10] !log dreamyjazz@deploy1003 matmarex, anzx, dreamyjazz: Backport for [[gerrit:1295502|Revert "labswiki: Disallow account autocreation"]], [[gerrit:1283106|Remove unused 'writeapi' right]], [[gerrit:1296566|Clean up bot password configuration]], [[gerrit:1296563|Remove workaround for stuck session cookies on Wikitech (T389433)]], [[gerrit:1295574|cswiki: lift IP cap for workshop on 08-June-2026 (T427678)]], [[gerrit:1296582 [15:15:10] |Use the globalblock-local-status right over globalblock-whitelist (T277942)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:15:19] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1069.eqiad.wmnet with reason: host reimage [15:15:38] Dreamy_Jazz: nothing to test for throttle patch [15:15:44] (03PS1) 10Urbanecm: feat(cleanMentorList): Add a feature flag [extensions/GrowthExperiments] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296614 (https://phabricator.wikimedia.org/T427386) [15:15:47] Thanks [15:15:57] (03Merged) 10jenkins-bot: mw-content-history-reconcile-enrich-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296612 (https://phabricator.wikimedia.org/T421237) (owner: 10Ottomata) [15:15:59] (03PS1) 10Urbanecm: feat(cleanMentorList): Add a feature flag [extensions/GrowthExperiments] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296615 (https://phabricator.wikimedia.org/T427386) [15:17:48] (03CR) 10Hashar: [C:03+1] "I have tested it ( T422258#11977056 ), though not with a new instance, but that got the job done. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [15:17:52] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:17:52] (03CR) 10Andrew Bogott: [C:03+2] designate: remove leftover mcrouter code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T427189) (owner: 10Andrew Bogott) [15:17:56] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:18:06] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS trixie [15:18:12] (03CR) 10Hashar: [C:03+1] "Verified and Puppet agent still pass with the profile removed ( T422258#11977087 ). Thx!" [puppet] - 10https://gerrit.wikimedia.org/r/1282007 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [15:18:25] !log dreamyjazz@deploy1003 matmarex, anzx, dreamyjazz: Continuing with deployment [15:18:55] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1070.eqiad.wmnet with OS trixie [15:19:15] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [15:19:20] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [15:19:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T426633)', diff saved to https://phabricator.wikimedia.org/P93582 and previous config saved to /var/cache/conftool/dbconfig/20260602-151958-fceratto.json [15:20:14] jiji@cumin1003 decommission (PID 124164) is awaiting input [15:20:19] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [15:20:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1188 (T426633)', diff saved to https://phabricator.wikimedia.org/P93583 and previous config saved to /var/cache/conftool/dbconfig/20260602-152026-fceratto.json [15:20:34] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1069.eqiad.wmnet with reason: host reimage [15:20:53] zabe: I'm seeing a lot of error logs relating to a script you are running [15:21:06] "InvalidArgumentException: No server with index '0'" [15:21:32] (03PS7) 10Clément Goubert: api-gateway: Pre-teardown deprecation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294957 (https://phabricator.wikimedia.org/T426881) [15:22:09] e.g. https://logstash.wikimedia.org/app/dashboards#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2026.06.02?id=a7PriJ4BwaIJ3BXy_YCr [15:22:37] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295502|Revert "labswiki: Disallow account autocreation"]], [[gerrit:1283106|Remove unused 'writeapi' right]], [[gerrit:1296566|Clean up bot password configuration]], [[gerrit:1296563|Remove workaround for stuck session cookies on Wikitech (T389433)]], [[gerrit:1295574|cswiki: lift IP cap for workshop on 08-June-2026 (T427678)]], [[gerrit:1296582|U [15:22:37] se the globalblock-local-status right over globalblock-whitelist (T277942)]] (duration: 09m 14s) [15:22:38] (03CR) 10Andrew Bogott: [C:03+2] Add new class, labs_lvm_ephemeral [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [15:22:41] (03CR) 10Andrew Bogott: [C:03+2] Remove profile::wmcs::lvm [puppet] - 10https://gerrit.wikimedia.org/r/1282007 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [15:22:43] T389433: Fix stuck old cookies on Wikitech - https://phabricator.wikimedia.org/T389433 [15:22:43] T427678: Lift IP cap on 2026-06-08 for Czech Wikipedia workshop - cs.wikipedia - https://phabricator.wikimedia.org/T427678 [15:22:44] T277942: Address Voice and Tone issues in GlobalBlocking - https://phabricator.wikimedia.org/T277942 [15:22:50] (Jenkins had some issue earlier and that got handled & fixed by j.nuche) [15:22:54] (03PS10) 10Andrew Bogott: Add new class, labs_lvm_ephemeral [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) [15:23:03] (03PS8) 10Andrew Bogott: Remove profile::wmcs::lvm [puppet] - 10https://gerrit.wikimedia.org/r/1282007 (https://phabricator.wikimedia.org/T422258) [15:25:09] jouncebot: nowandnext [15:25:09] For the next 0 hour(s) and 34 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1500) [15:25:10] In 0 hour(s) and 34 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1600) [15:25:21] Dreamy_Jazz: I may deploy something once you’re done [15:25:24] (03CR) 10Andrew Bogott: [C:03+2] Add new class, labs_lvm_ephemeral [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [15:25:26] I am done [15:25:27] (03CR) 10Andrew Bogott: [C:03+2] Remove profile::wmcs::lvm [puppet] - 10https://gerrit.wikimedia.org/r/1282007 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [15:25:31] You can go ahead [15:26:04] ok [15:26:23] 10ops-ulsfo, 06SRE, 06DC-Ops: ULSFO: Unrack old switches (asw2-22/23-ulsfo) - https://phabricator.wikimedia.org/T427283#11977212 (10ayounsi) [15:26:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296558 (https://phabricator.wikimedia.org/T421464) (owner: 10Kosta Harlan) [15:26:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296568 (https://phabricator.wikimedia.org/T421464) (owner: 10Kosta Harlan) [15:26:29] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2158: Repooling [15:27:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T426633)', diff saved to https://phabricator.wikimedia.org/P93585 and previous config saved to /var/cache/conftool/dbconfig/20260602-152726-fceratto.json [15:29:56] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:30:01] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10netops: Netbox - PuppetDB audit 2021-11 - https://phabricator.wikimedia.org/T295762#11977272 (10ayounsi) 05Open→03Resolved a:03ayounsi No need to keep that old parent task open. [15:30:01] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:31:44] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1070.eqiad.wmnet with reason: host reimage [15:32:11] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [15:32:16] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [15:32:54] (03CR) 10Dzahn: [C:03+2] Update name and address for bvibber, drop dead blog from planet [puppet] - 10https://gerrit.wikimedia.org/r/1296038 (owner: 10Bvibber) [15:33:16] (03CR) 10Dzahn: [C:03+2] "checked with Brooke" [puppet] - 10https://gerrit.wikimedia.org/r/1296038 (owner: 10Bvibber) [15:33:26] 06SRE, 06Infrastructure-Foundations, 10netops: Occasional high ICMP probe response from codfw to cr1-drmrs - https://phabricator.wikimedia.org/T315645#11977334 (10cmooney) 05Stalled→03Declined [15:33:43] 06SRE, 06Data-Engineering: Automate ingestion of netflow event stream - https://phabricator.wikimedia.org/T248865#11977336 (10ayounsi) [15:34:56] Dreamy_Jazz: Thanks for the head up! Restarted it, which seems to have fixed it. [15:35:03] Thanks! [15:35:18] (03CR) 10Dzahn: [C:03+2] admin: upgrade Mahmoud Abdelsattar from ldap_only to shell user [puppet] - 10https://gerrit.wikimedia.org/r/1295952 (https://phabricator.wikimedia.org/T427597) (owner: 10Dzahn) [15:35:31] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1070.eqiad.wmnet with reason: host reimage [15:36:10] (03PS1) 10Ottomata: mw-content-history-reconcile-enrich - allow fetching schemas from schema service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296619 (https://phabricator.wikimedia.org/T421237) [15:36:34] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1069.eqiad.wmnet with OS trixie [15:37:34] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1071.eqiad.wmnet with OS trixie [15:37:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P93586 and previous config saved to /var/cache/conftool/dbconfig/20260602-153734-fceratto.json [15:38:54] (03Merged) 10jenkins-bot: hCaptcha: Remove apiUrl health check and APCu layer from health checker [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296558 (https://phabricator.wikimedia.org/T421464) (owner: 10Kosta Harlan) [15:38:57] (03Merged) 10jenkins-bot: hCaptcha: Remove apiUrl health check and APCu layer from health checker [extensions/ConfirmEdit] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296568 (https://phabricator.wikimedia.org/T421464) (owner: 10Kosta Harlan) [15:40:00] (03CR) 10Ottomata: [C:03+2] mw-content-history-reconcile-enrich - allow fetching schemas from schema service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296619 (https://phabricator.wikimedia.org/T421237) (owner: 10Ottomata) [15:40:17] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1296558|hCaptcha: Remove apiUrl health check and APCu layer from health checker (T421464)]], [[gerrit:1296568|hCaptcha: Remove apiUrl health check and APCu layer from health checker (T421464)]] [15:40:21] T421464: hCaptcha: Stop using urldownloader for health checks of the secure-api.js file - https://phabricator.wikimedia.org/T421464 [15:40:41] (03CR) 10A-pizzata: [C:03+1] mw-content-history-reconcile-enrich - allow fetching schemas from schema service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296619 (https://phabricator.wikimedia.org/T421237) (owner: 10Ottomata) [15:42:01] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1296558|hCaptcha: Remove apiUrl health check and APCu layer from health checker (T421464)]], [[gerrit:1296568|hCaptcha: Remove apiUrl health check and APCu layer from health checker (T421464)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:42:06] (03Merged) 10jenkins-bot: mw-content-history-reconcile-enrich - allow fetching schemas from schema service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296619 (https://phabricator.wikimedia.org/T421237) (owner: 10Ottomata) [15:43:31] !log kharlan@deploy1003 kharlan: Continuing with deployment [15:47:41] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296558|hCaptcha: Remove apiUrl health check and APCu layer from health checker (T421464)]], [[gerrit:1296568|hCaptcha: Remove apiUrl health check and APCu layer from health checker (T421464)]] (duration: 07m 24s) [15:47:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P93587 and previous config saved to /var/cache/conftool/dbconfig/20260602-154742-fceratto.json [15:47:45] T421464: hCaptcha: Stop using urldownloader for health checks of the secure-api.js file - https://phabricator.wikimedia.org/T421464 [15:48:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295909 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [15:49:35] (03Merged) 10jenkins-bot: hCaptcha: Load self-hosted secure-api.js on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295909 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [15:49:51] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1295909|hCaptcha: Load self-hosted secure-api.js on group0 wikis (T403829)]] [15:49:55] T403829: hCaptcha: Self-host secure-api.js code in /static directory - https://phabricator.wikimedia.org/T403829 [15:50:17] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1071.eqiad.wmnet with reason: host reimage [15:51:39] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1295909|hCaptcha: Load self-hosted secure-api.js on group0 wikis (T403829)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:51:52] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1070.eqiad.wmnet with OS trixie [15:53:03] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host mc1072.eqiad.wmnet with OS trixie [15:53:04] (03PS1) 10Dreamy Jazz: core-Permissions: Stop assigning unused globalblock-whitelist right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296620 (https://phabricator.wikimedia.org/T277942) [15:53:36] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox, 10netops: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007#11977548 (10ayounsi) 05Open→03Resolved a:03ayounsi Looks like the current provisioning process with the `Port with no description on access switch` ale... [15:54:01] (03PS1) 10Ottomata: mw_page_html_content_change_enrich_next - remove temporary kafka cluster override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296623 (https://phabricator.wikimedia.org/T423920) [15:54:06] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1071.eqiad.wmnet with reason: host reimage [15:56:33] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11977586 (10Dzahn) Hi @mahmoud.abdelsattar.wmde give it max. ~ 30 minutes for the changes to deploy and you should have the access as requested. Ch... [15:56:39] jayme@cumin2002 reimage (PID 3605330) is awaiting input [15:56:59] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11977590 (10Dzahn) 05In progress→03Resolved a:03Dzahn ` [deploy1003:~] $ id mahmoud-abdelsattar uid=100472(mahmoud-abdelsattar) gid=500(wiki... [15:57:28] (03CR) 10Ottomata: [C:03+2] "this staging, staging is not running. Merging for future uses of staging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296623 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata) [15:57:35] Still verifying the above config patch [15:57:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T426633)', diff saved to https://phabricator.wikimedia.org/P93588 and previous config saved to /var/cache/conftool/dbconfig/20260602-155749-fceratto.json [15:58:10] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [15:58:10] (03CR) 10Komla Sapaty: "This is an attempt to determine the activity levels(using SSH login activity) of Toolforge users. It looks at the last successful login re" [puppet] - 10https://gerrit.wikimedia.org/r/1294864 (owner: 10Komla Sapaty) [15:58:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1197 (T426633)', diff saved to https://phabricator.wikimedia.org/P93589 and previous config saved to /var/cache/conftool/dbconfig/20260602-155817-fceratto.json [15:58:35] (03CR) 10Dzahn: "thanks for this one" [puppet] - 10https://gerrit.wikimedia.org/r/1296537 (https://phabricator.wikimedia.org/T356296) (owner: 10Majavah) [15:58:48] (03CR) 10Btullis: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1296608 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [15:59:04] !log kharlan@deploy1003 kharlan: Rolling back deployment [15:59:27] (03PS1) 10Kosta Harlan: Revert "hCaptcha: Load self-hosted secure-api.js on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296624 (https://phabricator.wikimedia.org/T403829) [15:59:30] (03Merged) 10jenkins-bot: mw_page_html_content_change_enrich_next - remove temporary kafka cluster override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296623 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata) [15:59:40] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295909|hCaptcha: Load self-hosted secure-api.js on group0 wikis (T403829)]] (duration: 09m 48s) [15:59:45] T403829: hCaptcha: Self-host secure-api.js code in /static directory - https://phabricator.wikimedia.org/T403829 [15:59:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296624 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [16:00:04] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:25] FIRING: SystemdUnitFailed: prometheus-puppet-ca-exporter.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:08] (03CR) 10Dzahn: [C:03+1] "lgtm, but I think this needs review from infra foundations?" [puppet] - 10https://gerrit.wikimedia.org/r/1296495 (https://phabricator.wikimedia.org/T420184) (owner: 10Arnaudb) [16:01:48] kostajh: i take it you're still deploying? [16:02:01] urbanecm: yes, will be done soon [16:02:06] perf [16:02:06] Rolling back a config patch [16:02:16] can i start CI on a MW backport? or should i wait [16:02:54] (03CR) 10Btullis: [C:03+2] dumps: http: Stop prepending the hostname to the syslog events [puppet] - 10https://gerrit.wikimedia.org/r/1296605 (https://phabricator.wikimedia.org/T425087) (owner: 10Btullis) [16:03:07] (03Merged) 10jenkins-bot: Revert "hCaptcha: Load self-hosted secure-api.js on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296624 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [16:03:20] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1296624|Revert "hCaptcha: Load self-hosted secure-api.js on group0 wikis" (T403829)]] [16:04:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295968 (https://phabricator.wikimedia.org/T426799) (owner: 10Marco Fossati) [16:04:38] urbanecm: I think you can start +2’ing things [16:04:42] (03CR) 10Dr0ptp4kt: Add filerevision to the mediawiki not-history sqoop (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1295047 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [16:04:45] (03CR) 10Urbanecm: [C:03+2] feat(cleanMentorList): Add a feature flag [extensions/GrowthExperiments] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296615 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [16:04:47] (03CR) 10Urbanecm: [C:03+2] feat(cleanMentorList): Add a feature flag [extensions/GrowthExperiments] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296614 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [16:05:09] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1296624|Revert "hCaptcha: Load self-hosted secure-api.js on group0 wikis" (T403829)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:05:13] T403829: hCaptcha: Self-host secure-api.js code in /static directory - https://phabricator.wikimedia.org/T403829 [16:05:24] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2006.codfw.wmnet with reason: host reimage [16:05:25] RESOLVED: SystemdUnitFailed: prometheus-puppet-ca-exporter.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T426633)', diff saved to https://phabricator.wikimedia.org/P93590 and previous config saved to /var/cache/conftool/dbconfig/20260602-160527-fceratto.json [16:05:43] 07Puppet, 06collaboration-services, 10Gerrit, 06Infrastructure-Foundations, 13Patch-For-Review: Change puppet-merge git origin to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org - https://phabricator.wikimedia.org/T420184#11977655 (10Dzahn) Ah, it's in the puppetserver module. Thanks! The pat... [16:05:47] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1072.eqiad.wmnet with reason: host reimage [16:05:49] !log kharlan@deploy1003 kharlan: Continuing with deployment [16:07:32] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1296608 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [16:08:15] 06SRE, 06Infrastructure-Foundations, 10netops: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364#11977663 (10ayounsi) 05Open→03Resolved I've renamed it to `Interface UP for 7 days with no description` when migrating the alert to AlertManager. Please... [16:08:47] (03PS1) 10Ahmon Dancy: Bump buildkit to v0.30.0 [puppet] - 10https://gerrit.wikimedia.org/r/1296626 (https://phabricator.wikimedia.org/T426212) [16:08:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:54] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2006.codfw.wmnet with reason: host reimage [16:10:01] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296624|Revert "hCaptcha: Load self-hosted secure-api.js on group0 wikis" (T403829)]] (duration: 06m 40s) [16:10:04] urbanecm: over to you [16:10:50] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1071.eqiad.wmnet with OS trixie [16:11:05] (03CR) 10Dr0ptp4kt: Add filerevision to the mediawiki not-history sqoop (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1295047 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [16:12:35] 06SRE, 06Infrastructure-Foundations: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897#11977704 (10CWilliams-WMF) > AttributeError: module 'urllib3.exceptions' has no attribute 'SubjectAltNameWarning' This looks to be an outdated PuppetDB config attempting to disable a warning that was... [16:13:00] (03PS1) 10Matthias Mullie: Add missing lazy img to carousel [extensions/MultimediaViewer] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296627 (https://phabricator.wikimedia.org/T427821) [16:13:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296627 (https://phabricator.wikimedia.org/T427821) (owner: 10Matthias Mullie) [16:14:08] (03PS3) 10Kamila Součková: admin: add apdube-wmf user [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553) [16:14:17] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1072.eqiad.wmnet with reason: host reimage [16:14:55] (03CR) 10CI reject: [V:04-1] admin: add apdube-wmf user [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553) (owner: 10Kamila Součková) [16:15:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P93591 and previous config saved to /var/cache/conftool/dbconfig/20260602-161534-fceratto.json [16:16:48] (03CR) 10Atsuko: [C:03+2] services_proxy: switch to prod opensearch-on-k8s services [puppet] - 10https://gerrit.wikimedia.org/r/1296608 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [16:17:11] (03PS1) 10JavierMonton: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296628 (https://phabricator.wikimedia.org/T425624) [16:17:52] (03Merged) 10jenkins-bot: feat(cleanMentorList): Add a feature flag [extensions/GrowthExperiments] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296615 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [16:17:55] (03CR) 10CI reject: [V:04-1] feat(cleanMentorList): Add a feature flag [extensions/GrowthExperiments] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296614 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [16:18:01] (03PS1) 10Atsuko: translate: adding separate read/write endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296631 (https://phabricator.wikimedia.org/T425377) [16:18:12] (03CR) 10Urbanecm: feat(cleanMentorList): Add a feature flag [extensions/GrowthExperiments] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296614 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [16:18:27] (03CR) 10Urbanecm: [C:03+2] feat(cleanMentorList): Add a feature flag [extensions/GrowthExperiments] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296614 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [16:18:52] (03PS1) 10Matthias Mullie: Image Browsing: add accessible labels to carousel elements [extensions/MultimediaViewer] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296632 (https://phabricator.wikimedia.org/T407793) [16:19:26] (03CR) 10Kosta Harlan: [C:04-2] hCaptcha: Roll out self-hosted secure-api.js to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295910 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [16:19:49] (03PS1) 10Kosta Harlan: Revert^2 "hCaptcha: Load self-hosted secure-api.js on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296635 (https://phabricator.wikimedia.org/T403829) [16:19:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296614 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [16:19:57] FIRING: [8x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:15] !ack [16:20:16] 8045 (ACKED) [8x] ProbeDown sre (probes/service) [16:20:21] (03CR) 10Kosta Harlan: [C:04-2] "Needs https://gitlab.wikimedia.org/repos/product-safety-and-integrity/hcaptcha-secure-api-vendor/-/merge_requests/3 to be merged and synce" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296635 (https://phabricator.wikimedia.org/T403829) (owner: 10Kosta Harlan) [16:20:42] o/ [16:20:57] FIRING: [3x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:21:13] urbanecm: what's train status? [16:21:13] looks pretty noisy, is it real? https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fservice&var-module=%24__all&orgId=1&from=now-15m&to=now&timezone=utc&var-site=%24__all&var-Filters [16:21:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:21:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 2.74% idle #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:21:19] federico3: yes [16:21:31] FIRING: [2x] RedisReplicaDown: Redis replica down rdb2014:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [16:21:34] cdanis: anything I can help with? [16:21:45] cdanis: not a train conductor. i only +2'ed some patches, but i stopped scap, so at most they'll merge [16:21:55] according to https://versions.toolforge.org/, we are still on wmf.4 [16:22:22] 10SRE-swift-storage, 06cloud-services-team, 06Commons: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949 (10Andrew) 03NEW [16:22:37] the errors are "503 Service Unavailable" with bunch of services [16:23:13] 10SRE-swift-storage, 06cloud-services-team, 06Commons: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11977818 (10Andrew) [16:23:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.273s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:24:06] 10SRE-swift-storage, 06cloud-services-team, 06Commons: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11977825 (10Andrew) [16:24:14] (03PS1) 10Btullis: Add the wdqs::alternative nodes to the S3/Ceph envoy firewall [puppet] - 10https://gerrit.wikimedia.org/r/1296636 (https://phabricator.wikimedia.org/T427319) [16:24:17] (03CR) 10JavierMonton: [C:03+2] stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296628 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [16:24:45] There are over 2 million log records in the last 15 minutes from kartotherian [16:24:48] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1296636 (https://phabricator.wikimedia.org/T427319) (owner: 10Btullis) [16:24:57] RESOLVED: [8x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:25:37] Ah, not all kartotherian. Many different services [16:25:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P93593 and previous config saved to /var/cache/conftool/dbconfig/20260602-162542-fceratto.json [16:25:57] RESOLVED: [8x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:26:05] !incidents [16:26:06] 8046 (UNACKED) PHPFPMTooBusy sre (mw-web main codfw) [16:26:06] 8045 (RESOLVED) [8x] ProbeDown sre (probes/service) [16:26:06] 8040 (RESOLVED) Host es2050 (paged) [16:26:06] 8039 (RESOLVED) Host db2175 (paged) [16:26:07] 8042 (RESOLVED) Host db2157 (paged) [16:26:07] 8043 (RESOLVED) Host db2153 (paged) [16:26:07] 8041 (RESOLVED) Host db2154 (paged) [16:26:08] 8044 (RESOLVED) Host db2176 (paged) [16:26:08] 8038 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [16:26:11] !ack [16:26:12] 8046 (ACKED) PHPFPMTooBusy sre (mw-web main codfw) [16:26:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:26:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 2.74% idle #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:26:31] FIRING: [5x] RedisReplicaDown: Redis replica down rdb2014:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [16:26:51] 10SRE-swift-storage, 06cloud-services-team, 06Commons: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11977884 (10MatthewVernon) TIFF compression is fairly easy via [[ https://manpages.debian.org/trixie/libtiff-tools/tiffcp.1.en.html | tiffcp ]] (I'm not a compression specialist, b... [16:26:52] (03Merged) 10jenkins-bot: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296628 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [16:27:37] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2006.codfw.wmnet with OS trixie [16:27:51] 10SRE-swift-storage, 06Commons: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11977905 (10Andrew) [16:28:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:28:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 0% idle #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:28:15] 10SRE-swift-storage, 06Commons: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11977910 (10MatthewVernon) TIFF compression can be done losslessly, so I see no reason to accept uncompressed TIFFs. [16:28:21] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 874.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:29:14] 06SRE, 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11977917 (10MatthewVernon) [16:29:45] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [16:30:08] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [16:30:12] FIRING: [8x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:30:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:30:16] Dreamy_Jazz: i just noticed that you shipped my config patches earlier, thanks! [16:30:24] !ack [16:30:25] 8047 (ACKED) PHPFPMTooBusy sre (mw-web main codfw) [16:30:25] 8048 (ACKED) [6x] ProbeDown sre (probes/service) [16:30:27] FIRING: [8x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:30:51] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1072.eqiad.wmnet with OS trixie [16:31:10] (03CR) 10JMeybohm: [C:03+1] CI: Fix CI pass on template render fail [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295947 (https://phabricator.wikimedia.org/T427307) (owner: 10Kamila Součková) [16:31:52] (03CR) 10Dzahn: [C:03+2] Bump buildkit to v0.30.0 [puppet] - 10https://gerrit.wikimedia.org/r/1296626 (https://phabricator.wikimedia.org/T426212) (owner: 10Ahmon Dancy) [16:32:14] No problem, I thought I might as well ship them with mine [16:32:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296631 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [16:33:13] (03Merged) 10jenkins-bot: feat(cleanMentorList): Add a feature flag [extensions/GrowthExperiments] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1296614 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [16:33:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:33:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 0% idle #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:33:40] (03CR) 10Dzahn: "No, I meant to say it should change nothing. The same procedure as before, just different code to get the same result to be able to rsync " [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn) [16:33:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:00] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [16:34:04] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [16:35:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:35:12] RESOLVED: [8x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:35:13] (03PS1) 10TrainBranchBot: testwikis to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296645 (https://phabricator.wikimedia.org/T423914) [16:35:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.028s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:35:16] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296645 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [16:35:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T426633)', diff saved to https://phabricator.wikimedia.org/P93594 and previous config saved to /var/cache/conftool/dbconfig/20260602-163550-fceratto.json [16:35:56] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:58] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [16:36:04] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [16:36:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance [16:36:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1229 (T426633)', diff saved to https://phabricator.wikimedia.org/P93595 and previous config saved to /var/cache/conftool/dbconfig/20260602-163622-fceratto.json [16:36:42] (03CR) 10Dr0ptp4kt: "Except maybe the logging piece. It would then inherit 64 mappers (instead of the 10 suggested in @aqu comment - which I'm gathering was ju" [puppet] - 10https://gerrit.wikimedia.org/r/1295045 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [16:41:17] (03Merged) 10jenkins-bot: testwikis to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296645 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [16:43:15] (03PS1) 10TrainBranchBot: testwikis to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296646 (https://phabricator.wikimedia.org/T423914) [16:43:18] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296646 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [16:43:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T426633)', diff saved to https://phabricator.wikimedia.org/P93596 and previous config saved to /var/cache/conftool/dbconfig/20260602-164328-fceratto.json [16:46:09] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11978047 (10jcrespo) [16:46:56] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11978050 (10Andrew) [16:47:31] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1050.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [16:48:10] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1050.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [16:48:10] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:48:11] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1050.eqiad.wmnet [16:49:20] (03CR) 10Dzahn: trafficserver: add a map for gitlab as a backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [16:49:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1296632 (https://phabricator.wikimedia.org/T407793) (owner: 10Matthias Mullie) [16:49:59] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1051.eqiad.wmnet [16:53:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P93597 and previous config saved to /var/cache/conftool/dbconfig/20260602-165336-fceratto.json [16:53:39] (03PS1) 10Dbrant: hCaptcha: Roll out to all except enwiki for mobile apps. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296649 (https://phabricator.wikimedia.org/T426048) [17:00:05] swfrench-wmf: May I have your attention please! MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1700) [17:01:29] o/ [17:03:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P93598 and previous config saved to /var/cache/conftool/dbconfig/20260602-170344-fceratto.json [17:03:53] dancy: it looks like you might be re-running presync? no worries if that's the case, as I should be able to work around it [17:04:26] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [17:05:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1071:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1071 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:05:54] swfrench-wmf: I'm blocked by CI issues at the moment so it's all yours. I'll try again later [17:08:20] dancy: hmmm ... it looks like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1296645 merged, so a sync-world now would pick that up, right? [17:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:10:26] jiji@cumin1003 decommission (PID 203985) is awaiting input [17:10:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1071:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1071 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:13:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T426633)', diff saved to https://phabricator.wikimedia.org/P93599 and previous config saved to /var/cache/conftool/dbconfig/20260602-171354-fceratto.json [17:14:00] (03CR) 10Kamila Součková: [C:03+2] CI: Fix CI pass on template render fail [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295947 (https://phabricator.wikimedia.org/T427307) (owner: 10Kamila Součková) [17:14:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance [17:14:17] (03CR) 10Scott French: [C:03+2] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296585 (owner: 10Scott French) [17:14:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1233 (T426633)', diff saved to https://phabricator.wikimedia.org/P93600 and previous config saved to /var/cache/conftool/dbconfig/20260602-171422-fceratto.json [17:17:28] (03CR) 10JHathaway: [C:03+1] sre.puppet.disable-merges: New cookbook to disable Puppet merges temporarily [cookbooks] - 10https://gerrit.wikimedia.org/r/1295425 (https://phabricator.wikimedia.org/T248872) (owner: 10Muehlenhoff) [17:21:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T426633)', diff saved to https://phabricator.wikimedia.org/P93601 and previous config saved to /var/cache/conftool/dbconfig/20260602-172135-fceratto.json [17:26:16] (03CR) 10Kosta Harlan: [C:03+1] hCaptcha: Roll out to all except enwiki for mobile apps. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296649 (https://phabricator.wikimedia.org/T426048) (owner: 10Dbrant) [17:31:23] swfrench-wmf: dancy: note that i probably have undeployed code merged, as it merged during the incident [17:31:33] happy to finalise the deployment, but i see we're now in MW infra [17:31:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P93602 and previous config saved to /var/cache/conftool/dbconfig/20260602-173143-fceratto.json [17:32:43] urbanecm: thanks! yeah, I've not touched MediaWiki, as I don't know the state of /srv/mediawiki-staging. which is to say, if you'd like to pick up your deployment, please go ahead (I'm trying to understand a CI issue blocking other work I have planned). [17:32:58] okay, let me finish it then [17:33:18] unless, dancy has any concerns that is [17:33:37] swfrench-wmf: sorry for the late reply. Yes, that wikiversions change will be picked up. That works for me if it works for you [17:34:17] ah, cool - no objections on my end. FYI, urbanecm ^ it looks like testwikis will also get .5 in the same deployment. [17:34:39] yeah, just saw that in scap [17:34:43] so, that's expected i guess? [17:34:54] yes, expected it seems [17:35:38] That does mean it will be a long deployment. If that's a pain, I can revert the wiki etsi [17:35:50] *the wikiversions change [17:36:24] fine with me [17:36:31] Great [17:37:01] kostajh: scap says you have some undeployed backports (hCaptcha: Remove apiUrl health check and APCu layer from health checker) [17:37:04] is that expected? [17:38:35] or Dreamy_Jazz [17:38:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296649 (https://phabricator.wikimedia.org/T426048) (owner: 10Dbrant) [17:40:19] (03Merged) 10jenkins-bot: CI: Fix CI pass on template render fail [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295947 (https://phabricator.wikimedia.org/T427307) (owner: 10Kamila Součková) [17:40:22] (03Merged) 10jenkins-bot: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296585 (owner: 10Scott French) [17:41:50] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1296615|feat(cleanMentorList): Add a feature flag (T427386)]], [[gerrit:1296614|feat(cleanMentorList): Add a feature flag (T427386)]] [17:41:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P93603 and previous config saved to /var/cache/conftool/dbconfig/20260602-174150-fceratto.json [17:41:54] T427386: Deploy automated mentor list cleanup to Wikimedia wikis - https://phabricator.wikimedia.org/T427386 [17:42:25] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [17:42:41] (03PS1) 10SBassett: varnish: Add CSP report-only directives for all of upload.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1296654 (https://phabricator.wikimedia.org/T117618) [17:42:51] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [17:42:52] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [17:43:05] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [17:43:06] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [17:43:20] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [17:43:21] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:43:36] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:43:37] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [17:43:54] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [17:43:56] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [17:44:18] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [17:44:50] Hi folks, I would need to run a query on x1.wikishared to recover accidentally deleted data. It's a trivial query affecting ~60 records, which I pasted in https://phabricator.wikimedia.org/T427962#11978299. May I go ahead and run that? [17:45:25] Daimona: it might be helpful to have a +1 on the query beforehand, just in case (TM) [17:46:12] Does a self +1 count? :D (I tried it locally but you aren't wrong) [17:47:04] otherwise, as long as you wrap it in a transaction (just in case more records are affected) and it's not a regular thing, makes sense. but i'd recommend the review anyway :)) [17:47:19] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:48:02] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:48:33] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:49:10] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:49:41] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:49:59] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:50:18] I just double-checked, it's 61 records affected. The review should be coming soon :) [17:50:31] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:50:55] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:51:27] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [17:51:51] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [17:51:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T426633)', diff saved to https://phabricator.wikimedia.org/P93604 and previous config saved to /var/cache/conftool/dbconfig/20260602-175157-fceratto.json [17:52:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance [17:52:22] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:52:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1254 (T426633)', diff saved to https://phabricator.wikimedia.org/P93605 and previous config saved to /var/cache/conftool/dbconfig/20260602-175227-fceratto.json [17:53:23] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:53:32] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [17:55:53] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1051.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [17:56:29] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1051.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [17:56:29] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:56:30] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1051.eqiad.wmnet [17:57:58] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:58:22] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:59:32] jiji@cumin1003 decommission (PID 263264) is awaiting input [17:59:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T426633)', diff saved to https://phabricator.wikimedia.org/P93607 and previous config saved to /var/cache/conftool/dbconfig/20260602-175933-fceratto.json [18:00:05] dancy and jnuche: That opportune time for a MediaWiki train - Utc-7+Utc-0 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T1800). [18:00:32] (deployment finishing) [18:00:51] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1052.eqiad.wmnet [18:01:31] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1296615|feat(cleanMentorList): Add a feature flag (T427386)]], [[gerrit:1296614|feat(cleanMentorList): Add a feature flag (T427386)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:01:34] T427386: Deploy automated mentor list cleanup to Wikimedia wikis - https://phabricator.wikimedia.org/T427386 [18:01:50] !log urbanecm@deploy1003 urbanecm: Continuing with deployment [18:01:50] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [18:02:23] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [18:02:38] !log reverting shellbox to 2026-05-20-192555 due to errors in shellbox-syntaxhighlight [18:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:54] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [18:04:29] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [18:05:00] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [18:05:15] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [18:05:46] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:05:49] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:06:20] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [18:06:51] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [18:07:22] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [18:08:37] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [18:09:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P93608 and previous config saved to /var/cache/conftool/dbconfig/20260602-180941-fceratto.json [18:10:48] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [18:12:16] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [18:12:32] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [18:12:34] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [18:13:00] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [18:13:01] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [18:13:10] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [18:13:11] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:13:20] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:13:21] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [18:13:30] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [18:13:31] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [18:13:39] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [18:13:56] FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:14:05] Alright I got a +1 :) May I run the query from https://phabricator.wikimedia.org/T427962#11978299 in x1.wikishared now? [18:15:41] (03PS1) 10Scott French: shellbox: Revert to 2026-05-20-192555 images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296659 [18:16:00] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296615|feat(cleanMentorList): Add a feature flag (T427386)]], [[gerrit:1296614|feat(cleanMentorList): Add a feature flag (T427386)]] (duration: 34m 09s) [18:16:04] T427386: Deploy automated mentor list cleanup to Wikimedia wikis - https://phabricator.wikimedia.org/T427386 [18:16:49] jiji@cumin1003 decommission (PID 263264) is awaiting input [18:18:23] (03Abandoned) 10Ahmon Dancy: testwikis to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296646 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [18:18:48] Daimona: 👍 from me [18:18:54] Also, done with deployment [18:19:08] urbanecm: Thanks! [18:19:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P93609 and previous config saved to /var/cache/conftool/dbconfig/20260602-181949-fceratto.json [18:20:51] Going ahead then, ty :) [18:21:39] !log Running query from T427962#11978299 in x1.wikishared [18:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:44] T427962: Accidentally removed (unregistered) all 61 participants from an event on Meta-Wiki: is the data recoverable? - https://phabricator.wikimedia.org/T427962 [18:24:59] !log Train is blocked at testwikis on https://phabricator.wikimedia.org/T427935 [18:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:23] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1052.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [18:25:45] (03CR) 10Scott French: [C:03+2] shellbox: Revert to 2026-05-20-192555 images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296659 (owner: 10Scott French) [18:26:03] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1052.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [18:26:04] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:26:04] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1052.eqiad.wmnet [18:26:13] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.9 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296660 (https://phabricator.wikimedia.org/T427543) [18:27:49] !log gerrit delete unused plugin projects: barricade, WikimediaBlocks and WikimediaWebSessions [18:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:10] (03Merged) 10jenkins-bot: shellbox: Revert to 2026-05-20-192555 images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296659 (owner: 10Scott French) [18:29:07] jiji@cumin1003 decommission (PID 284379) is awaiting input [18:29:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T426633)', diff saved to https://phabricator.wikimedia.org/P93610 and previous config saved to /var/cache/conftool/dbconfig/20260602-182956-fceratto.json [18:30:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1259.eqiad.wmnet with reason: Maintenance [18:30:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1259 (T426633)', diff saved to https://phabricator.wikimedia.org/P93611 and previous config saved to /var/cache/conftool/dbconfig/20260602-183023-fceratto.json [18:32:51] dancy: FYI, you may see me making some changes to shellbox in the background. should not conflict with your work (trying to debug something). [18:33:54] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1053.eqiad.wmnet [18:35:14] swfrench-wmf: thx. Train is currently blocked so nothing happening there right now [18:35:39] ah, got it. best of luck unblocking. [18:37:04] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:37:22] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:37:23] (03CR) 10Komla Sapaty: "I will go ahead and redact the usernames so that they are not logged, either in the CSV file or in the DB" [puppet] - 10https://gerrit.wikimedia.org/r/1294864 (owner: 10Komla Sapaty) [18:37:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T426633)', diff saved to https://phabricator.wikimedia.org/P93612 and previous config saved to /var/cache/conftool/dbconfig/20260602-183749-fceratto.json [18:38:32] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:38:38] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:38:49] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [18:38:56] RESOLVED: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:44:54] jiji@cumin1003 decommission (PID 284379) is awaiting input [18:47:01] (03Abandoned) 10Andrew Bogott: rabbitmq: add haproxy in front of codfw1dev endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1260100 (https://phabricator.wikimedia.org/T420937) (owner: 10Andrew Bogott) [18:47:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P93614 and previous config saved to /var/cache/conftool/dbconfig/20260602-184757-fceratto.json [18:52:00] (03PS1) 10TrainBranchBot: group0 to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296662 (https://phabricator.wikimedia.org/T423914) [18:52:03] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296662 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [18:53:03] (03Merged) 10jenkins-bot: group0 to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296662 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [18:58:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P93615 and previous config saved to /var/cache/conftool/dbconfig/20260602-185804-fceratto.json [19:01:20] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1295967/8630/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn) [19:04:26] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:05:19] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.47.0-wmf.5 refs T423914 [19:05:23] T423914: 1.47.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T423914 [19:06:59] (03PS1) 10Bartosz Dziewoński: 'purge_temporary_accounts' job is owned by PSI, not MWP [puppet] - 10https://gerrit.wikimedia.org/r/1296664 [19:08:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T426633)', diff saved to https://phabricator.wikimedia.org/P93616 and previous config saved to /var/cache/conftool/dbconfig/20260602-190811-fceratto.json [19:08:37] (03CR) 10Dzahn: [V:03+1 C:03+2] "/usr/local/sbin/sync-gerrit* files have been created - but no timers to do anything automatically - as intended" [puppet] - 10https://gerrit.wikimedia.org/r/1295967 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn) [19:09:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [19:09:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T426633)', diff saved to https://phabricator.wikimedia.org/P93617 and previous config saved to /var/cache/conftool/dbconfig/20260602-190907-fceratto.json [19:13:02] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.9 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296660 (https://phabricator.wikimedia.org/T427543) (owner: 10Santiago Faci) [19:15:08] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.9 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296660 (https://phabricator.wikimedia.org/T427543) (owner: 10Santiago Faci) [19:37:32] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1053.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [19:40:37] jiji@cumin1003 decommission (PID 284379) is awaiting input [19:48:25] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1053.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [19:48:25] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:48:26] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1053.eqiad.wmnet [19:49:44] (03CR) 10Ryan Kemper: [C:03+2] Add the wdqs::alternative nodes to the S3/Ceph envoy firewall [puppet] - 10https://gerrit.wikimedia.org/r/1296636 (https://phabricator.wikimedia.org/T427319) (owner: 10Btullis) [19:51:28] jiji@cumin1003 decommission (PID 341305) is awaiting input [19:55:04] (03CR) 10Dreamy Jazz: [C:03+1] "PSI like their alerting to go to slack, but I assume this is set up per team so should be handled by this" [puppet] - 10https://gerrit.wikimedia.org/r/1296664 (owner: 10Bartosz Dziewoński) [20:00:02] (03PS4) 10Kamila Součková: admin: add apdube-wmf user [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553) [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:31] (indeed) [20:03:30] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295979 (https://phabricator.wikimedia.org/T427553) (owner: 10Kamila Součková) [20:03:45] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mc1054.eqiad.wmnet [20:07:16] (03PS1) 10Effie Mouzeli: site.pp: mc1054 is being decommed [puppet] - 10https://gerrit.wikimedia.org/r/1296672 (https://phabricator.wikimedia.org/T426303) [20:09:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T426633)', diff saved to https://phabricator.wikimedia.org/P93618 and previous config saved to /var/cache/conftool/dbconfig/20260602-200922-fceratto.json [20:11:25] (03CR) 10JHathaway: [C:03+1] site.pp: mc1054 is being decommed [puppet] - 10https://gerrit.wikimedia.org/r/1296672 (https://phabricator.wikimedia.org/T426303) (owner: 10Effie Mouzeli) [20:12:10] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: mc1054 is being decommed [puppet] - 10https://gerrit.wikimedia.org/r/1296672 (https://phabricator.wikimedia.org/T426303) (owner: 10Effie Mouzeli) [20:18:56] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:18:58] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: re-rack mc2055 (before Jun 9th) - https://phabricator.wikimedia.org/T427373#11979006 (10jijiki) >>! In T427373#11976649, @Jhancock.wm wrote: > @jijiki i'm ready whenever you are to do the move. should only take about 20-30 mi... [20:19:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P93619 and previous config saved to /var/cache/conftool/dbconfig/20260602-201929-fceratto.json [20:20:14] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [20:20:56] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:21:18] (03CR) 10Ottomata: [C:03+1] kafka event platform logs - Strip the stray $!msg field [puppet] - 10https://gerrit.wikimedia.org/r/1296607 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [20:21:21] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Repurpose ganeti102[3456] for Zuul migration - https://phabricator.wikimedia.org/T427353#11979010 (10wiki_willy) a:03VRiley-WMF [20:22:43] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06ServiceOps new, and 2 others: decommission mc10[37-54] - https://phabricator.wikimedia.org/T426303#11979017 (10jijiki) [20:23:13] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06ServiceOps new, and 2 others: decommission mc10[37-54] - https://phabricator.wikimedia.org/T426303#11979019 (10jijiki) @Jclark-ctr over to you folks! [20:23:16] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06ServiceOps new, and 2 others: decommission mc10[37-54] - https://phabricator.wikimedia.org/T426303#11979020 (10Jclark-ctr) a:03Jclark-ctr [20:26:06] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1054.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [20:26:31] FIRING: [5x] RedisReplicaDown: Redis replica down rdb2014:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [20:27:28] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc1054.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [20:27:28] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:27:29] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1054.eqiad.wmnet [20:27:39] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06ServiceOps new, and 2 others: decommission mc10[37-54] - https://phabricator.wikimedia.org/T426303#11979032 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1003 for hosts: `mc1054.eqiad.wmnet` - mc1054.eqiad.wmnet (**PASS**)... [20:29:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P93620 and previous config saved to /var/cache/conftool/dbconfig/20260602-202937-fceratto.json [20:39:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T426633)', diff saved to https://phabricator.wikimedia.org/P93621 and previous config saved to /var/cache/conftool/dbconfig/20260602-203945-fceratto.json [20:45:20] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Roll out to all except enwiki for mobile apps. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296649 (https://phabricator.wikimedia.org/T426048) (owner: 10Dbrant) [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260602T2100) [21:01:55] (03CR) 10JHathaway: "I'm a little wary of copying over the existing privileges, I would" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [21:07:30] (03CR) 10BCornwall: [C:03+1] wmnet: Update x3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1296511 (https://phabricator.wikimedia.org/T427895) (owner: 10Gerrit maintenance bot) [21:08:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11979241 (10Jclark-ctr) I was double checking and i was looking at model not serail.. verified again it is actually slot 5 . Disk 5 on Embedded AHCI Controller 2 Available... [21:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:09:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11979250 (10Jclark-ctr) Removed Failed drive Verified sdb has been removed ` jclark@centrallog1002:~$ cat /proc/mdstat Personalities : [raid10] [linear] [multipath] [raid0]... [21:11:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11979264 (10Jclark-ctr) New drive has been Attached @colewhite ready to be rebuilt ` [Tue Jun 2 21:09:44 2026] sd 7:0:0:0: Attached scsi generic sg5 type 0 [Tue Jun 2 21:... [21:13:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mc10[37-54] - https://phabricator.wikimedia.org/T426303#11979267 (10jijiki) [21:18:57] (03CR) 10BCornwall: [C:03+1] P:cache:haproxy add image generator information (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) (owner: 10Slyngshede) [21:19:27] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11979309 (10Ladsgroup) As some data: ` mysql:research@dbstore1007.eqiad.wmnet [commonswiki]> select actor_name, sum(fr_size) from filerevision jo... [21:23:23] (03PS1) 10Jforrester: Drop the abstractwiki-rust-web images, no longer used [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296681 (https://phabricator.wikimedia.org/T425340) [21:23:26] (03PS1) 10Jforrester: abstractwiki-rust: Bake in semgrep, cargo-chef, clang, and clippy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1296682 (https://phabricator.wikimedia.org/T427989) [21:27:43] (03PS1) 10Eevans: linked-artifacts: update for production deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296683 (https://phabricator.wikimedia.org/T414140) [21:43:28] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [21:59:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11979474 (10colewhite) 05Open→03In progress [21:59:13] (03PS1) 10Dzahn: site: add releases[12]004 with collab insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1296687 (https://phabricator.wikimedia.org/T418299) [22:00:19] (03PS1) 10Dzahn: docker_registry: add next releases hosts (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1296688 [22:01:11] (03CR) 10CI reject: [V:04-1] docker_registry: add next releases hosts (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1296688 (owner: 10Dzahn) [22:02:21] (03CR) 10Dreamy Jazz: hCaptcha: Enable for badlogin on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296551 (https://phabricator.wikimedia.org/T426875) (owner: 10Dreamy Jazz) [22:02:27] jouncebot: nowandnext [22:02:28] No deployments scheduled for the next 7 hour(s) and 57 minute(s) [22:02:28] In 7 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260603T0600) [22:02:51] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11979495 (10Ladsgroup) or in other words, total storage of our originals looks like this: https://grafana.wikimedia.org/d/75a174f3-44b6-4416-a8b8-... [22:02:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296551 (https://phabricator.wikimedia.org/T426875) (owner: 10Dreamy Jazz) [22:04:42] (03Merged) 10jenkins-bot: hCaptcha: Enable for badlogin on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296551 (https://phabricator.wikimedia.org/T426875) (owner: 10Dreamy Jazz) [22:05:16] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1296551|hCaptcha: Enable for badlogin on group0 wikis (T426875)]] [22:05:21] T426875: hCaptcha: Support usage in "always challenge" SiteKey for badlogin - https://phabricator.wikimedia.org/T426875 [22:05:44] (03PS1) 10Dreamy Jazz: hCaptcha: Correct inaccurate comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296689 [22:05:54] (03CR) 10CI reject: [V:04-1] hCaptcha: Correct inaccurate comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296689 (owner: 10Dreamy Jazz) [22:06:08] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11979500 (10Ladsgroup) Notified the uploader: https://commons.wikimedia.org/wiki/User_talk:PantheraLeo1359531#Compression_of_TIFF_files [22:06:13] (03PS2) 10Dreamy Jazz: hCaptcha: Correct inaccurate comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296689 [22:07:15] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1296551|hCaptcha: Enable for badlogin on group0 wikis (T426875)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:09:31] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [22:10:19] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [22:10:39] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [22:13:47] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296551|hCaptcha: Enable for badlogin on group0 wikis (T426875)]] (duration: 08m 31s) [22:13:51] T426875: hCaptcha: Support usage in "always challenge" SiteKey for badlogin - https://phabricator.wikimedia.org/T426875 [22:14:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296689 (owner: 10Dreamy Jazz) [22:14:28] (03PS1) 10Santiago Faci: Revert "Test Kitchen UI: Deploy v1.3.9 release to production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296690 [22:15:00] (03CR) 10Clare Ming: [C:03+2] Revert "Test Kitchen UI: Deploy v1.3.9 release to production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296690 (owner: 10Santiago Faci) [22:15:05] (03Merged) 10jenkins-bot: hCaptcha: Correct inaccurate comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296689 (owner: 10Dreamy Jazz) [22:15:30] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1296689|hCaptcha: Correct inaccurate comment]] [22:17:13] (03Merged) 10jenkins-bot: Revert "Test Kitchen UI: Deploy v1.3.9 release to production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296690 (owner: 10Santiago Faci) [22:17:29] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1296689|hCaptcha: Correct inaccurate comment]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:17:48] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [22:18:28] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [22:18:46] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [22:21:58] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296689|hCaptcha: Correct inaccurate comment]] (duration: 06m 27s) [22:26:49] (03CR) 10Cwhite: [C:03+1] kafka event platform logs - Strip the stray $!msg field [puppet] - 10https://gerrit.wikimedia.org/r/1296607 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [22:40:36] (03PS2) 10Arlolra: Deploy PRV to 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296015 (https://phabricator.wikimedia.org/T427851) [23:04:41] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:15:01] (03CR) 10Scott French: "Alas, I did not get a chance to deploy this today, and will aim for tomorrow instead." [puppet] - 10https://gerrit.wikimedia.org/r/1296036 (https://phabricator.wikimedia.org/T418200) (owner: 10Scott French) [23:39:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1296697 [23:39:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1296697 (owner: 10TrainBranchBot) [23:40:28] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11979680 (10Ladsgroup) I wrote a script to proactively compress tiff files, and it works pretty nice so far: ` Processing: File:LVGL-SL - DOP20IR... [23:45:42] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11979686 (10Ladsgroup) Example: https://commons.wikimedia.org/wiki/File:LVGL-SL_-_DOP20IR_-_346000_5490000_(2025).tif [23:47:51] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11979687 (10Ladsgroup) And it can't even upload the new files: ` ERROR: An error occurred for uri https://commons.wikimedia.org/w/api.php ERROR: T... [23:51:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1296697 (owner: 10TrainBranchBot)