[00:15:14] ok if i deploy something real quick? low risk, no-op [00:21:47] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:21:54] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [00:22:31] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:23:03] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [00:29:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cp5032.eqsin.wmnet [00:31:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS trixie [00:31:11] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11965905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cp5032.eqsin.wmnet with OS... [00:32:36] (03PS1) 10Aleksandar Mastilovic: dse-k8s-services Blunderbuss: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295100 (https://phabricator.wikimedia.org/T423573) [00:34:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:07:48] (03PS1) 10Papaul: Add interfaces ae1.512 and 522 to dhcp relay [homer/public] - 10https://gerrit.wikimedia.org/r/1295101 (https://phabricator.wikimedia.org/T427393) [01:09:13] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:09:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1295102 [01:09:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1295102 (owner: 10TrainBranchBot) [01:11:27] (03PS3) 10Santiago Faci: Remove `wgTestKitchenExperimentStreamNames` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285412 (https://phabricator.wikimedia.org/T422358) [01:12:02] (03CR) 10Papaul: [C:03+2] Add interfaces ae1.512 and 522 to dhcp relay [homer/public] - 10https://gerrit.wikimedia.org/r/1295101 (https://phabricator.wikimedia.org/T427393) (owner: 10Papaul) [01:14:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5032.eqsin.wmnet with reason: host reimage [01:18:04] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1295102 (owner: 10TrainBranchBot) [01:18:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5032.eqsin.wmnet with reason: host reimage [01:43:06] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:47:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5032.eqsin.wmnet with OS trixie [01:47:39] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11965987 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cp5032.eqsin.wmnet with OS trix... [01:52:01] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11965992 (10Papaul) @BCornwall re-image done on cp5032. The node is now on the new private1-604-eqsin vlan. The DHCP issue I was having,... [01:53:25] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11965995 (10BCornwall) Sweet, thanks! I'll re-pool tomorrow. [01:54:44] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11965996 (10Papaul) [02:20:10] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11966000 (10Papaul) Please see below steps before re-imaging a node into the new vlan - Netbox 1- Search for the node in Netbox 2- Click... [02:55:34] RECOVERY - Host db1224 #page is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [02:56:25] !log vriley@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1224.eqiad.wmnet [02:59:44] vriley@cumin1003 upgrade-firmware (PID 2791340) is awaiting input [03:00:13] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1224.eqiad.wmnet [03:00:25] !log vriley@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1224.eqiad.wmnet [03:01:55] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts db1224.eqiad.wmnet [03:06:52] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11966002 (10VRiley-WMF) From the available firmware choices, it seems as though it's up to date. I know the BIOS is completely upto date. I was able to login to the machine and see it was stuck at a certa... [03:13:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:13:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:44:56] (03CR) 10Clare Ming: [C:03+2] test-kitchen: Update chart to add a new config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295092 (https://phabricator.wikimedia.org/T421803) (owner: 10Santiago Faci) [03:45:20] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295094 (https://phabricator.wikimedia.org/T421803) (owner: 10Santiago Faci) [03:47:14] (03Merged) 10jenkins-bot: test-kitchen: Update chart to add a new config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295092 (https://phabricator.wikimedia.org/T421803) (owner: 10Santiago Faci) [03:47:25] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295094 (https://phabricator.wikimedia.org/T421803) (owner: 10Santiago Faci) [04:34:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:13] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:43:06] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260529T0600) [06:06:22] (03CR) 10Thiemo Kreuz (WMDE): [C:03+2] Update VE core submodule to master (9cf5524e7) [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [06:27:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host urldownloader2006.wikimedia.org [06:27:27] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [06:31:27] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader2006.wikimedia.org - jmm@cumin2002" [06:31:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader2006.wikimedia.org - jmm@cumin2002" [06:31:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:31:34] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache urldownloader2006.wikimedia.org on all recursors [06:31:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) urldownloader2006.wikimedia.org on all recursors [06:32:16] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM urldownloader2006.wikimedia.org - jmm@cumin2002" [06:32:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM urldownloader2006.wikimedia.org - jmm@cumin2002" [06:33:28] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11966120 (10cmooney) Awesome work @papaul! I think possibly you can just reimage with the —move-vlan flag for the cookbook? We should... [06:34:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host urldownloader2006.wikimedia.org with OS trixie [06:34:23] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#11966121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host urldownloader2006.wikimedia.org with OS trixie [06:38:19] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#11966122 (10MoritzMuehlenhoff) [06:39:38] (03CR) 10Muehlenhoff: [C:03+1] zuul: replace user/group setup with systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1286999 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [06:41:43] (03PS12) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) [06:42:45] (03CR) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [06:50:07] (03PS4) 10Trueg: dse-k8s-eqiad: Add wdqs namespaces for the new deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) [06:51:17] (03CR) 10Trueg: dse-k8s-eqiad: Add wdqs namespaces for the new deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [06:52:57] (03PS1) 10Muehlenhoff: package_builder: Switch to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1295362 (https://phabricator.wikimedia.org/T416707) [06:53:07] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on urldownloader2006.wikimedia.org with reason: host reimage [06:59:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on urldownloader2006.wikimedia.org with reason: host reimage [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260529T0700) [07:01:53] (03CR) 10Awight: "Maybe the last CR+2 was an accident—but please don't merge this today, it's on a backport branch and there are no deployments on Friday." [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [07:06:00] 07sre-alert-triage, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Alert in need of triage: PuppetFailure (instance an-test-client1002:9100) - https://phabricator.wikimedia.org/T427399#11966150 (10Gehel) [07:06:09] 07sre-alert-triage, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Alert in need of triage: PuppetFailure (instance an-test-client1002:9100) - https://phabricator.wikimedia.org/T427399#11966152 (10Gehel) p:05Triage→03High [07:06:57] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:11:57] RESOLVED: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:13:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:13:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:15:51] (03CR) 10Thiemo Kreuz (WMDE): "Yes, that was a mistake and why I removed the +2. As far as I can tell our CI understands this and doesn't merge the patch then." [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [07:16:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host urldownloader2006.wikimedia.org with OS trixie [07:16:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host urldownloader2006.wikimedia.org [07:16:22] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#11966183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host urldownloader2006.wikimedia.org with OS trixie completed: - urldownloader2006 (**PASS**) - Rem... [07:17:17] (03CR) 10Awight: "It would be possible to merge this way. Thank you for removing the CR+2" [extensions/VisualEditor] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294826 (https://phabricator.wikimedia.org/T424232) (owner: 10WMDE-Fisch) [07:22:02] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2014.codfw.wmnet with OS bookworm [07:22:14] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11966188 (10FCeratto-WMF) I'm seeing the following errors in the logs that look a bit suspicious, specifically the `N/A, transition to Non-recoverable ; CPU 2 ;`, could it be a hardware issue? ` May 29 0... [07:24:56] (03PS1) 10Jelto: miscweb: update wmf-navigator images and DATA_DIR config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295366 (https://phabricator.wikimedia.org/T414405) [07:25:19] !incidents [07:25:20] 8024 (ACKED, 40h 50m old) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [07:25:20] 8025 (ACKED, 40h 50m old) db2189 (paged)/MariaDB Replica IO: s2 (paged) [07:25:20] 8026 (ACKED, 40h 50m old) db2189 (paged)/MariaDB Replica Lag: s2 (paged) [07:25:20] 8030 (RESOLVED) Host db1224 (paged) [07:25:20] 8029 (RESOLVED) ProbeDown sre (2620:0:861:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqiad) [07:31:31] (03PS1) 10Brouberol: mediawiki-dumps-legacy: enable sync pods to egress to our s3 endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295368 (https://phabricator.wikimedia.org/T426764) [07:34:24] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for Audrey Penven - https://phabricator.wikimedia.org/T427531#11966195 (10karapayneWMDE) As the EM for Wikidata, I approve this! [07:35:20] (03CR) 10Klausman: [C:03+1] Set Maglev's scheduling for inference-staging and ingress [puppet] - 10https://gerrit.wikimedia.org/r/1294226 (https://phabricator.wikimedia.org/T420438) (owner: 10Elukey) [07:35:42] (03CR) 10Klausman: [C:03+1] role::ml_k8s::staging::worker: enable IPIP encapsulation [puppet] - 10https://gerrit.wikimedia.org/r/1294225 (https://phabricator.wikimedia.org/T420438) (owner: 10Elukey) [07:36:02] (03CR) 10Klausman: [C:03+1] Set ml-staging-ctrl to the Maglev scheduler and fix stale options [puppet] - 10https://gerrit.wikimedia.org/r/1294224 (https://phabricator.wikimedia.org/T420438) (owner: 10Elukey) [07:36:17] (03CR) 10Klausman: [C:03+1] role::ml_k8s::staging::master: enable IPIP encapsulation [puppet] - 10https://gerrit.wikimedia.org/r/1294223 (https://phabricator.wikimedia.org/T420438) (owner: 10Elukey) [07:43:14] (03PS1) 10Elukey: docker_registry: remove duplicates from registry-homepage-builder.py [puppet] - 10https://gerrit.wikimedia.org/r/1295371 (https://phabricator.wikimedia.org/T420978) [07:52:29] (03CR) 10Jelto: [C:03+2] miscweb: update wmf-navigator images and DATA_DIR config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295366 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [07:54:23] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2212.codfw.wmnet [07:54:23] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2212.codfw.wmnet [07:54:55] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2212: Pooling [07:54:58] (03Merged) 10jenkins-bot: miscweb: update wmf-navigator images and DATA_DIR config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295366 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [07:58:59] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [07:59:29] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597 (10mahmoud.abdelsattar.wmde) 03NEW [07:59:40] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [07:59:48] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:00:00] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260529T0700) [08:00:05] jelto, arnoldokoth, mutante, and arnaudb: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260529T0800). [08:00:13] (03CR) 10Arnaudb: [C:03+2] gitlab: use service name for upstream addr [puppet] - 10https://gerrit.wikimedia.org/r/1294219 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [08:00:33] PROBLEM - Host cr2-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [08:00:42] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:01:07] !incidents [08:01:07] 8024 (ACKED, 41h 26m old) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [08:01:08] 8025 (ACKED, 41h 26m old) db2189 (paged)/MariaDB Replica IO: s2 (paged) [08:01:08] 8026 (ACKED, 41h 26m old) db2189 (paged)/MariaDB Replica Lag: s2 (paged) [08:01:08] 8031 (ACKED) Host cr2-drmrs [08:01:08] 8030 (RESOLVED) Host db1224 (paged) [08:01:09] 8029 (RESOLVED) ProbeDown sre (2620:0:861:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqiad) [08:01:10] FIRING: [6x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.129 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:01:33] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11966257 (10karapayneWMDE) As the EM of the Wikidata team, I approve this! [08:01:39] FIRING: [7x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr2-drmrs (185.15.58.140) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:01:44] PROBLEM - Host cr2-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:01:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-b12-drmrs:et-0/0/50 (Core: cr2-drmrs:et-0/0/2 {#D0103}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:01:56] hmm something is up with drmrs ? [08:02:50] Checking if we have some known maintenance [08:04:48] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:05:00] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:05:14] Oookay [08:05:21] RECOVERY - Host cr2-drmrs is UP: PING OK - Packet loss = 0%, RTA = 88.88 ms [08:05:36] the pag.e resolved and the alerts as well [08:05:42] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:05:57] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2212: Pooling [08:06:05] XioNoX / topranks FYI see alerts above [08:06:06] (03PS1) 10Jcrespo: install_server: Set backup2014 to be fully reimaged [puppet] - 10https://gerrit.wikimedia.org/r/1295388 (https://phabricator.wikimedia.org/T424661) [08:06:10] FIRING: [6x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.129 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:06:39] FIRING: [10x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr2-drmrs (185.15.58.140) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:06:46] RECOVERY - Host cr2-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 87.63 ms [08:06:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - asw1-b12-drmrs:et-0/0/50 (Core: cr2-drmrs:et-0/0/2 {#D0103}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:08:26] (03PS1) 10Federico Ceratto: Enable notifications for db2212 [puppet] - 10https://gerrit.wikimedia.org/r/1295389 (https://phabricator.wikimedia.org/T427388) [08:08:53] !incidents [08:08:53] 8024 (ACKED, 41h 34m old) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [08:08:53] 8025 (ACKED, 41h 34m old) db2189 (paged)/MariaDB Replica IO: s2 (paged) [08:08:54] 8026 (ACKED, 41h 34m old) db2189 (paged)/MariaDB Replica Lag: s2 (paged) [08:08:54] 8031 (RESOLVED) Host cr2-drmrs [08:08:54] 8030 (RESOLVED) Host db1224 (paged) [08:08:54] 8029 (RESOLVED) ProbeDown sre (2620:0:861:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqiad) [08:09:21] I logged this also to corto bot in _security in case this happens again [08:09:49] Thank :-) [08:09:58] I'm just digging around a bit [08:10:01] (03PS1) 10Brouberol: service: add the growthbook-api(-next) services [puppet] - 10https://gerrit.wikimedia.org/r/1295394 (https://phabricator.wikimedia.org/T427570) [08:10:03] (03PS1) 10Brouberol: service: set the growthbook services state to production [puppet] - 10https://gerrit.wikimedia.org/r/1295395 (https://phabricator.wikimedia.org/T427570) [08:10:05] (03PS1) 10Brouberol: service_proxy: register growthbook(-next) services [puppet] - 10https://gerrit.wikimedia.org/r/1295396 (https://phabricator.wikimedia.org/T427570) [08:11:10] RESOLVED: [6x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.129 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:11:39] RESOLVED: [10x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr2-drmrs (185.15.58.140) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:12:15] slyngs: thanks, we're actually both off today afaik [08:12:27] fwiw eqiad<->drmrs link looks ok now, fyi our de-cix connection in codfw is down, but should be ok [08:12:36] (03CR) 10Jcrespo: [C:03+2] install_server: Set backup2014 to be fully reimaged [puppet] - 10https://gerrit.wikimedia.org/r/1295388 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [08:13:00] topranks: Enjoy your day off then :-) [08:16:14] (03CR) 10Clément Goubert: [C:03+1] service: add the growthbook-api(-next) services [puppet] - 10https://gerrit.wikimedia.org/r/1295394 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [08:16:29] (03CR) 10Clément Goubert: [C:03+1] service: set the growthbook services state to production [puppet] - 10https://gerrit.wikimedia.org/r/1295395 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [08:16:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host urldownloader1005.wikimedia.org [08:16:57] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:16:59] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [08:17:00] (03CR) 10Clément Goubert: [C:03+1] service_proxy: register growthbook(-next) services [puppet] - 10https://gerrit.wikimedia.org/r/1295396 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [08:17:03] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:18:57] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [08:20:39] jynus@cumin2002 reimage (PID 2346233) is awaiting input [08:21:18] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader1005.wikimedia.org - jmm@cumin2002" [08:21:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader1005.wikimedia.org - jmm@cumin2002" [08:21:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:21:36] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache urldownloader1005.wikimedia.org on all recursors [08:21:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) urldownloader1005.wikimedia.org on all recursors [08:22:03] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:23:49] (03CR) 10Brouberol: [C:03+2] service: add the growthbook-api(-next) services [puppet] - 10https://gerrit.wikimedia.org/r/1295394 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [08:25:35] (03PS2) 10Federico Ceratto: db2212: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295389 (https://phabricator.wikimedia.org/T427388) [08:26:10] (03PS1) 10Federico Ceratto: db1224: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295397 (https://phabricator.wikimedia.org/T427535) [08:27:42] (03PS1) 10Muehlenhoff: toolforge::elasticsearch::haproxy: Restrict to cloud network [puppet] - 10https://gerrit.wikimedia.org/r/1295398 (https://phabricator.wikimedia.org/T149804) [08:29:48] (03CR) 10CI reject: [V:04-1] toolforge::elasticsearch::haproxy: Restrict to cloud network [puppet] - 10https://gerrit.wikimedia.org/r/1295398 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [08:30:10] (03CR) 10MVernon: [C:03+1] db2212: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295389 (https://phabricator.wikimedia.org/T427388) (owner: 10Federico Ceratto) [08:30:42] (03CR) 10Jcrespo: [C:03+1] db2212: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295389 (https://phabricator.wikimedia.org/T427388) (owner: 10Federico Ceratto) [08:31:02] (03CR) 10MVernon: [C:03+1] db1224: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295397 (https://phabricator.wikimedia.org/T427535) (owner: 10Federico Ceratto) [08:31:33] !log jynus@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2014.codfw.wmnet with OS bookworm [08:32:05] (03CR) 10Federico Ceratto: [C:03+1] db1224: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295397 (https://phabricator.wikimedia.org/T427535) (owner: 10Federico Ceratto) [08:32:18] (03CR) 10Federico Ceratto: [C:03+2] db1224: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295397 (https://phabricator.wikimedia.org/T427535) (owner: 10Federico Ceratto) [08:32:31] (03CR) 10Federico Ceratto: [C:03+2] db2212: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295389 (https://phabricator.wikimedia.org/T427388) (owner: 10Federico Ceratto) [08:33:49] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2014.codfw.wmnet with OS bookworm [08:34:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:43] (03PS2) 10Muehlenhoff: toolforge::elasticsearch::haproxy: Restrict to cloud network [puppet] - 10https://gerrit.wikimedia.org/r/1295398 (https://phabricator.wikimedia.org/T149804) [08:36:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1294951 (owner: 10Majavah) [08:36:49] !log atsuko@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [08:37:05] !log atsuko@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [08:37:18] !log atsuko@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [08:38:06] (03CR) 10Muehlenhoff: [C:03+2] pontoon:lb: Restrict firewall services to CLOUD_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/1295015 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [08:38:14] !log atsuko@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [08:39:08] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2212: Pooling [08:39:10] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#11966359 (10MoritzMuehlenhoff) [08:39:17] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2212: Pooling [08:40:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285412 (https://phabricator.wikimedia.org/T422358) (owner: 10Santiago Faci) [08:41:25] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-drmrs unexpected reboot - https://phabricator.wikimedia.org/T427600 (10cmooney) 03NEW p:05Triage→03Medium [08:42:23] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2212: Pooling [08:42:32] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2212: Pooling [08:46:54] !log gnt-instance modify -B memory=4g,vcpus=1 etherpad1004.eqiad.wmnet - T427588 [08:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:59] T427588: etherpad showing transient connection issues - https://phabricator.wikimedia.org/T427588 [08:47:58] !log jelto@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM etherpad1004.eqiad.wmnet [08:49:01] (03PS1) 10Jcrespo: installserver: Reimage fully backup2014, not backup2012 [puppet] - 10https://gerrit.wikimedia.org/r/1295402 (https://phabricator.wikimedia.org/T424661) [08:49:15] (03PS2) 10Jcrespo: installserver: Reimage fully backup2014, not backup2012 [puppet] - 10https://gerrit.wikimedia.org/r/1295402 (https://phabricator.wikimedia.org/T424661) [08:49:20] !log atsuko@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [08:49:20] (03CR) 10CI reject: [V:04-1] installserver: Reimage fully backup2014, not backup2012 [puppet] - 10https://gerrit.wikimedia.org/r/1295402 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [08:50:05] !log jynus@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup2014.codfw.wmnet with OS bookworm [08:50:14] (03CR) 10Gmodena: dse-k8s-eqiad: Add wdqs namespaces for the new deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [08:50:18] !log atsuko@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [08:51:32] (03CR) 10Jcrespo: [C:03+2] installserver: Reimage fully backup2014, not backup2012 [puppet] - 10https://gerrit.wikimedia.org/r/1295402 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [08:51:49] !log jelto@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM etherpad1004.eqiad.wmnet [08:53:43] (03CR) 10Gmodena: [C:03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [08:54:09] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2014.codfw.wmnet with OS bookworm [08:59:38] !log gnt-instance modify -B memory=4g,vcpus=1 etherpad2002.codfw.wmnet - T427588 [08:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:43] T427588: etherpad showing transient connection issues - https://phabricator.wikimedia.org/T427588 [08:59:53] !log jelto@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM etherpad2002.codfw.wmnet [09:00:02] (03PS1) 10Atsuko: Cleanup old values for turnilo and eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295405 (https://phabricator.wikimedia.org/T348763) [09:01:04] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T427553#11966446 (10APDube-WMF) @Aklapper Thanks for the prompt! I've linked the LDAP account to my Phabricator account. Let me know if anything else is missing in the request. Thanks! [09:03:23] (03CR) 10Brouberol: [C:03+2] service: set the growthbook services state to production [puppet] - 10https://gerrit.wikimedia.org/r/1295395 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [09:03:32] (03PS2) 10Brouberol: service: set the growthbook services state to production [puppet] - 10https://gerrit.wikimedia.org/r/1295395 (https://phabricator.wikimedia.org/T427570) [09:03:50] !log jelto@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM etherpad2002.codfw.wmnet [09:07:46] (03CR) 10Tiziano Fogli: [C:03+2] performance.w.o: restrict blackbox check to ip4 [puppet] - 10https://gerrit.wikimedia.org/r/1293091 (https://phabricator.wikimedia.org/T425299) (owner: 10Tiziano Fogli) [09:08:03] (03CR) 10Brouberol: [C:03+2] service: set the growthbook services state to production [puppet] - 10https://gerrit.wikimedia.org/r/1295395 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [09:09:13] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:10:27] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM urldownloader1005.wikimedia.org - jmm@cumin2002" [09:10:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM urldownloader1005.wikimedia.org - jmm@cumin2002" [09:10:48] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T427553#11966461 (10Aklapper) [09:11:01] (03CR) 10Brouberol: [C:03+2] service_proxy: register growthbook(-next) services [puppet] - 10https://gerrit.wikimedia.org/r/1295396 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [09:12:33] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2014.codfw.wmnet with reason: host reimage [09:13:34] jmm@cumin2002 makevm (PID 2357860) is awaiting input [09:20:40] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2014.codfw.wmnet with reason: host reimage [09:22:04] (03PS1) 10Brouberol: test-kitchen-next: reach out to the growthbook-api-next through the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295407 (https://phabricator.wikimedia.org/T427570) [09:33:08] (03PS1) 10Atsuko: service: move eventstreams-internal to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) [09:33:17] (03PS1) 10Atsuko: service: move eventstreams-internal to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) [09:33:41] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [09:33:47] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:34:01] (03PS1) 10Jcrespo: Revert "install_server: Set backup2014 to be fully reimaged" [puppet] - 10https://gerrit.wikimedia.org/r/1295412 [09:34:23] (03CR) 10CI reject: [V:04-1] Revert "install_server: Set backup2014 to be fully reimaged" [puppet] - 10https://gerrit.wikimedia.org/r/1295412 (owner: 10Jcrespo) [09:35:16] (03PS5) 10Trueg: dse-k8s-eqiad: Add wdqs namespaces for the new deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) [09:35:40] (03PS2) 10Jcrespo: Revert "install_server: Set backup2014 to be fully reimaged" [puppet] - 10https://gerrit.wikimedia.org/r/1295412 [09:36:12] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:36:49] (03PS2) 10Brouberol: test-kitchen-next: reach out to the growthbook-api-next through the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295407 (https://phabricator.wikimedia.org/T427570) [09:37:30] (03PS3) 10Jcrespo: Revert "install_server: Set backup2014 to be fully reimaged" [puppet] - 10https://gerrit.wikimedia.org/r/1295412 [09:37:33] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295412 (owner: 10Jcrespo) [09:40:28] (03PS2) 10Atsuko: service: move eventstreams-internal to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) [09:40:42] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [09:41:12] RESOLVED: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:41:46] ah, those kubestagemaster alerts are me, let me see if i can silence the next few... [09:43:06] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:44:14] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2014.codfw.wmnet with OS bookworm [09:45:28] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:47:52] (03CR) 10Jcrespo: [C:03+2] Revert "install_server: Set backup2014 to be fully reimaged" [puppet] - 10https://gerrit.wikimedia.org/r/1295412 (owner: 10Jcrespo) [09:49:15] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:50:13] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [09:51:21] 14SRE-Sprint-Week-Sustainability-March2023, 10SRE-tools, 06Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677#11966555 (10MLechvien-WMF) [09:55:17] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2212: Pooling [09:55:31] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2212 failed to reboot - https://phabricator.wikimedia.org/T427388#11966557 (10ops-monitoring-bot) Starting pool of db2212 by fceratto@cumin1003: Pooling [09:59:54] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:01:17] (03PS3) 10Atsuko: service: move eventstreams-internal to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) [10:01:32] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [10:01:51] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:02:28] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [10:04:07] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [10:05:20] (03CR) 10Brouberol: [C:03+1] service: move eventstreams-internal to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [10:05:47] (03PS4) 10Atsuko: service: move eventstreams-internal to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) [10:05:48] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:05:57] (03CR) 10Santiago Faci: [C:03+2] test-kitchen-next: reach out to the growthbook-api-next through the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295407 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [10:06:45] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [10:08:01] (03Merged) 10jenkins-bot: test-kitchen-next: reach out to the growthbook-api-next through the mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295407 (https://phabricator.wikimedia.org/T427570) (owner: 10Brouberol) [10:08:43] (03CR) 10Gmodena: [C:03+1] dse-k8s-eqiad: Add wdqs namespaces for the new deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [10:11:25] (03PS5) 10Atsuko: service: move eventstreams-internal to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) [10:11:36] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [10:12:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:18:13] (03CR) 10Atsuko: [C:03+2] service: move eventstreams-internal to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295409 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [10:24:24] (03PS2) 10Atsuko: service: move eventstreams-internal to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) [10:24:55] (03CR) 10CI reject: [V:04-1] service: move eventstreams-internal to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [10:25:42] (03PS3) 10Atsuko: service: move eventstreams-internal to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) [10:27:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host urldownloader1005.wikimedia.org with OS trixie [10:27:11] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#11966641 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host urldownloader1005.wikimedia.org with OS trixie [10:27:52] (03CR) 10Muehlenhoff: [C:03+2] package_builder: Switch to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1295362 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [10:30:03] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [10:35:48] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:37:52] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on urldownloader1005.wikimedia.org with reason: host reimage [10:38:10] (03PS4) 10Atsuko: service: move eventstreams-internal to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) [10:38:24] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [10:40:46] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2212: Pooling [10:40:52] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2212 failed to reboot - https://phabricator.wikimedia.org/T427388#11966662 (10ops-monitoring-bot) Completed pooling of db2212 by fceratto@cumin1003: Pooling [10:41:24] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:43:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on urldownloader1005.wikimedia.org with reason: host reimage [10:45:41] (03PS1) 10Muehlenhoff: sre.puppet.disable-merges: New cookbook to disable Puppet merges temporarily [cookbooks] - 10https://gerrit.wikimedia.org/r/1295425 (https://phabricator.wikimedia.org/T248872) [10:48:25] (03CR) 10CI reject: [V:04-1] sre.puppet.disable-merges: New cookbook to disable Puppet merges temporarily [cookbooks] - 10https://gerrit.wikimedia.org/r/1295425 (https://phabricator.wikimedia.org/T248872) (owner: 10Muehlenhoff) [10:50:38] (03PS2) 10Muehlenhoff: sre.puppet.disable-merges: New cookbook to disable Puppet merges temporarily [cookbooks] - 10https://gerrit.wikimedia.org/r/1295425 (https://phabricator.wikimedia.org/T248872) [10:50:54] jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade. [10:53:54] jelto@cumin1003 upgrade (PID 2904376) is awaiting input [10:54:13] (03PS1) 10Clément Goubert: swift::proxy: Deploy shadow ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/1295430 (https://phabricator.wikimedia.org/T414440) [10:59:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host urldownloader1005.wikimedia.org with OS trixie [10:59:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host urldownloader1005.wikimedia.org [10:59:29] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#11966708 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host urldownloader1005.wikimedia.org with OS trixie completed: - urldownloader1005 (**PASS**) - Rem... [11:00:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host urldownloader1006.wikimedia.org [11:00:18] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:00:34] !incidents [11:00:35] 8024 (ACKED, 44h 26m old) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [11:00:35] 8025 (ACKED, 44h 26m old) db2189 (paged)/MariaDB Replica IO: s2 (paged) [11:00:35] 8026 (ACKED, 44h 25m old) db2189 (paged)/MariaDB Replica Lag: s2 (paged) [11:00:35] 8031 (RESOLVED) Host cr2-drmrs [11:00:35] 8030 (RESOLVED) Host db1224 (paged) [11:00:35] 8029 (RESOLVED) ProbeDown sre (2620:0:861:ed1a::1 ip6 text:80 probes/service http_text_ip6 eqiad) [11:04:56] (03PS1) 10Muehlenhoff: Mark the wikidough ports as intentionally open to the world [puppet] - 10https://gerrit.wikimedia.org/r/1295431 (https://phabricator.wikimedia.org/T149804) [11:06:03] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader1006.wikimedia.org - jmm@cumin2002" [11:06:11] (03PS2) 10Muehlenhoff: Mark the wikidough ports as intentionally open to the world [puppet] - 10https://gerrit.wikimedia.org/r/1295431 (https://phabricator.wikimedia.org/T149804) [11:08:40] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:09:08] jmm@cumin2002 makevm (PID 2395982) is awaiting input [11:10:40] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27958 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:12:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader1006.wikimedia.org - jmm@cumin2002" [11:12:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:12:25] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache urldownloader1006.wikimedia.org on all recursors [11:12:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) urldownloader1006.wikimedia.org on all recursors [11:13:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:13:09] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [11:13:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [11:15:27] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM urldownloader1006.wikimedia.org - jmm@cumin2002" [11:15:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM urldownloader1006.wikimedia.org - jmm@cumin2002" [11:18:34] jmm@cumin2002 makevm (PID 2395982) is awaiting input [11:31:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295431 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [11:36:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host urldownloader1006.wikimedia.org with OS trixie [11:36:22] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#11966841 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host urldownloader1006.wikimedia.org with OS trixie [11:41:00] (03Abandoned) 10Muehlenhoff: cloud: wmf-auto-restart: exclude NFS filesystems [puppet] - 10https://gerrit.wikimedia.org/r/920644 (https://phabricator.wikimedia.org/T316544) (owner: 10Arturo Borrero Gonzalez) [11:43:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1259874 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [11:43:30] (03PS1) 10Jelto: sre.gitlab.upgrade: increase downtime for backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1295440 (https://phabricator.wikimedia.org/T427614) [11:44:25] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2212 failed to reboot - https://phabricator.wikimedia.org/T427388#11966873 (10FCeratto-WMF) 05In progress→03Resolved [11:44:42] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1259896 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [11:47:03] (03PS1) 10Clément Goubert: ratelimit: Add CACHE_KEY_PREFIX configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295442 (https://phabricator.wikimedia.org/T424051) [11:47:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on urldownloader1006.wikimedia.org with reason: host reimage [11:51:27] (03PS1) 10Jcrespo: dbbackups: Temp. enable ES read-only backup to refresh on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295443 (https://phabricator.wikimedia.org/T424661) [11:51:56] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295443 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [11:54:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on urldownloader1006.wikimedia.org with reason: host reimage [11:54:59] (03PS2) 10Jcrespo: dbbackups: Temp. enable ES read-only backup to refresh on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295443 (https://phabricator.wikimedia.org/T424661) [11:55:02] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295443 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [11:55:38] (03PS3) 10Jcrespo: dbbackups: Temp. enable ES read-only backup to refresh on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295443 (https://phabricator.wikimedia.org/T424661) [11:55:46] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295443 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [11:57:52] 10SRE-tools, 06Infrastructure-Foundations, 10Phabricator, 13Patch-For-Review: offboard-user: Replace deprecated (frozen) Phabricator Conduit API calls with their stable equivalents - https://phabricator.wikimedia.org/T420324#11966939 (10Aklapper) [12:09:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host urldownloader1006.wikimedia.org with OS trixie [12:09:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host urldownloader1006.wikimedia.org [12:09:23] 06SRE, 06Infrastructure-Foundations: Move URL downloaders to trixie - https://phabricator.wikimedia.org/T427282#11966968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host urldownloader1006.wikimedia.org with OS trixie completed: - urldownloader1006 (**PASS**) - Rem... [12:09:51] (03CR) 10Brouberol: [C:03+1] Cleanup old values for turnilo and eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295405 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [12:09:54] (03CR) 10Jcrespo: [C:03+2] dbbackups: Temp. enable ES read-only backup to refresh on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1295443 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [12:10:00] (03CR) 10Brouberol: [C:03+1] service: move eventstreams-internal to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [12:10:43] (03PS1) 10Jcrespo: Revert "dbbackups: Temp. enable ES read-only backup to refresh on new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1295448 [12:20:21] (03PS2) 10Jcrespo: Revert "dbbackups: Temp. enable ES read-only backup to refresh on new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1295448 [12:20:41] (03PS3) 10Jcrespo: Revert "dbbackups: Temp. enable ES read-only backup to refresh on new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1295448 [12:23:55] (03CR) 10Jcrespo: [C:03+2] Revert "dbbackups: Temp. enable ES read-only backup to refresh on new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1295448 (owner: 10Jcrespo) [12:26:32] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1259898 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [12:29:15] (03PS1) 10Svantje Lilienthal: Disable the creation of synthetic main refs in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295454 (https://phabricator.wikimedia.org/T427484) [12:32:31] (03PS1) 10Muehlenhoff: Apply urldownloader role to urldownloader2005 [puppet] - 10https://gerrit.wikimedia.org/r/1295455 (https://phabricator.wikimedia.org/T427282) [12:33:37] (03CR) 10Jgreen: "I'm not clear why we're expiring these so frequently? Is there some reason I'm missing why annual expiration is not reasonable?" [puppet] - 10https://gerrit.wikimedia.org/r/1295021 (owner: 10Elukey) [12:34:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:16] (03PS1) 10Clément Goubert: ratelimit-media: policy and user-class level metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295457 (https://phabricator.wikimedia.org/T424051) [12:39:23] (03CR) 10Arnaudb: [C:03+1] "lgtm, I hope this is the sweet spot" [cookbooks] - 10https://gerrit.wikimedia.org/r/1295440 (https://phabricator.wikimedia.org/T427614) (owner: 10Jelto) [12:41:06] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thank you! Patch applies cleanly locally on latest wmf/stable, apart from some neglectable whitespace error output." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1292091 (owner: 10Pppery) [12:48:14] (03CR) 10CDanis: [C:03+1] docker_registry: remove duplicates from registry-homepage-builder.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1295371 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [12:55:18] (03CR) 10Majavah: [C:03+2] P:sre::nftables_compat_check: Install python3-pypuppetdb [puppet] - 10https://gerrit.wikimedia.org/r/1294951 (owner: 10Majavah) [13:01:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission lvs1016.eqiad.wmnet - https://phabricator.wikimedia.org/T427451#11967132 (10Jclark-ctr) [13:05:16] (03PS3) 10Bking: dse-k8s: Create kubeconfigs for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/1295068 (https://phabricator.wikimedia.org/T425007) [13:09:13] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:09:58] (03CR) 10Bking: [C:03+2] OpenSearch: Add required config for bootstrapping [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [13:16:45] (03PS1) 10AOkoth: site: apply production role to phab2003 [puppet] - 10https://gerrit.wikimedia.org/r/1295460 (https://phabricator.wikimedia.org/T423727) [13:19:41] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11967201 (10Papaul) @cmooney thank you, yes move-vlan flag cookbook will also work we need to test that. I don't think we have done any i... [13:20:04] (03CR) 10Bking: [C:03+2] dse-k8s: Create kubeconfigs for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/1295068 (https://phabricator.wikimedia.org/T425007) (owner: 10Bking) [13:21:32] (03CR) 10Hashar: [C:03+1] "thx! 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1294392 (owner: 10Dzahn) [13:29:49] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: Add wdqs namespaces for the new deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [13:29:54] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1295460/8617/" [puppet] - 10https://gerrit.wikimedia.org/r/1295460 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [13:34:22] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:35:13] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:43:15] jhathaway: we would like to do a Friday deploy for a UBN issue [13:43:35] https://phabricator.wikimedia.org/T427625 [13:44:00] (03PS1) 10Trueg: dse-k8s-codfw: Add wdqs namespaces for the new deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295465 (https://phabricator.wikimedia.org/T425007) [13:44:08] kostajh: sounds good [13:44:10] jnuche has already given a +1 [13:44:14] jhathaway: thx [13:44:44] (03PS1) 10Kosta Harlan: GlobalPreferencesHandler: Cast auto-reveal expiry to int [extensions/CheckUser] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295466 (https://phabricator.wikimedia.org/T427625) [13:45:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295466 (https://phabricator.wikimedia.org/T427625) (owner: 10Kosta Harlan) [13:49:41] (03PS2) 10Kosta Harlan: GlobalPreferencesHandler: Cast auto-reveal expiry to int [extensions/CheckUser] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295466 (https://phabricator.wikimedia.org/T427625) [13:51:49] (03CR) 10Dreamy Jazz: [C:03+2] GlobalPreferencesHandler: Cast auto-reveal expiry to int [extensions/CheckUser] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295466 (https://phabricator.wikimedia.org/T427625) (owner: 10Kosta Harlan) [13:51:51] (03CR) 10TrainBranchBot: "Approved by kharlan@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295466 (https://phabricator.wikimedia.org/T427625) (owner: 10Kosta Harlan) [13:53:15] (03CR) 10MVernon: [C:03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1295430 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [13:53:27] !log imported OpenJDK 21 21.0.11+10-1~deb12u1 to component/jdk21 (backport of latest Java 21 security release for Bookworm) [13:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:30] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting access to Analytics Data Lake for kevmon/kmontalva-wmf - https://phabricator.wikimedia.org/T427279#11967533 (10atsuko) 05Open→03In progress [13:57:47] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting Access to Analytics Data Lake - https://phabricator.wikimedia.org/T427197#11967534 (10atsuko) 05Open→03In progress [14:06:44] (03Merged) 10jenkins-bot: GlobalPreferencesHandler: Cast auto-reveal expiry to int [extensions/CheckUser] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295466 (https://phabricator.wikimedia.org/T427625) (owner: 10Kosta Harlan) [14:07:45] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1295466|GlobalPreferencesHandler: Cast auto-reveal expiry to int (T427625)]] [14:07:51] T427625: TypeError: MediaWiki\Extension\CheckUser\Logging\TemporaryAccountLogger::logAutoRevealAccessEnabled(): Argument #2 ($expiry) must be of type int, string given, called in /srv/mediawiki/php-1.47.0-wmf.4/extensions/CheckUser/src/ - https://phabricator.wikimedia.org/T427625 [14:09:48] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1295466|GlobalPreferencesHandler: Cast auto-reveal expiry to int (T427625)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:35] (03PS1) 10Muehlenhoff: Bitu: Switch to idm-sre-approval@wikimedia.org for notifications [puppet] - 10https://gerrit.wikimedia.org/r/1295467 [14:11:33] !log kharlan@deploy1003 kharlan: Continuing with deployment [14:15:44] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295466|GlobalPreferencesHandler: Cast auto-reveal expiry to int (T427625)]] (duration: 07m 58s) [14:15:49] T427625: TypeError: MediaWiki\Extension\CheckUser\Logging\TemporaryAccountLogger::logAutoRevealAccessEnabled(): Argument #2 ($expiry) must be of type int, string given, called in /srv/mediawiki/php-1.47.0-wmf.4/extensions/CheckUser/src/ - https://phabricator.wikimedia.org/T427625 [14:16:44] jhathaway jnuche: deployed, all done. Thanks! [14:17:39] 🎉 [14:17:46] (03CR) 10Gmodena: [C:03+1] dse-k8s-codfw: Add wdqs namespaces for the new deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295465 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [14:21:03] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:21:12] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:21:41] 06SRE, 06Traffic: netconsole being used for cache hosts? - https://phabricator.wikimedia.org/T427646 (10MoritzMuehlenhoff) 03NEW [14:29:31] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting access to Analytics Data Lake for kevmon/kmontalva-wmf - https://phabricator.wikimedia.org/T427279#11967728 (10atsuko) There is already existing account `kmontalva-wmf`, but with [[ https://phabricator.wikimedia.org/T4212... [14:33:15] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting access to Analytics Data Lake for kevmon/kmontalva-wmf - https://phabricator.wikimedia.org/T427279#11967749 (10atsuko) 05In progress→03Open [14:40:19] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting Access to Analytics Data Lake - https://phabricator.wikimedia.org/T427197#11967784 (10atsuko) 05In progress→03Open There is already existing account `wmf-ldlulisa` created in T421214, with this level of privileges, bu... [14:42:12] !incidents [14:42:13] 8031 (RESOLVED) Host cr2-drmrs [14:42:13] 8030 (RESOLVED) Host db1224 (paged) [14:43:05] federico3: did you get push notifications too? [14:43:51] cdanis: yes and resolved it but I'm not sure why it's not showing up here and also why it was sent [14:44:51] idk they're quite old, I only got the resolution message [14:46:19] cdanis: to be on the safe side I checked the host again and it's ok [14:53:39] cdanis: hey I’m not around. if cr2-drmrs is failing again maybe depool the site [14:54:03] I’m not gonna be able to look until Sunday otherwise [14:54:06] topranks: it's not [14:54:22] we got a https://en.wikipedia.org/wiki/Long_delayed_echo from VictorOps [14:54:31] ok gotcha [14:54:46] yeah no need to panic otherwise. [15:00:07] (03CR) 10Dzahn: "This host will get the database config and since any connection from dbproxy is allowed .. it could write to it in parallel to the current" [puppet] - 10https://gerrit.wikimedia.org/r/1295460 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [15:03:56] (03CR) 10Dzahn: [C:03+1] sre.gitlab.upgrade: increase downtime for backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1295440 (https://phabricator.wikimedia.org/T427614) (owner: 10Jelto) [15:09:23] 06SRE, 06Traffic: netconsole being used for cache hosts? - https://phabricator.wikimedia.org/T427646#11967877 (10ssingh) Yeah it's a good question and predates me so I don't have good answers. But, it doesn't seem that it is enabled for upload? The upload role does say `include profile::netconsole::client` but... [15:13:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:13:14] (03CR) 10Ssingh: [C:03+1] "This requires Pybal restarts, so please coordinate with Traffic for that. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [15:13:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [15:15:14] (03CR) 10Atsuko: "Scheduled on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/1295410 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [15:16:09] 06SRE, 06Traffic: netconsole being used for cache hosts? - https://phabricator.wikimedia.org/T427646#11967900 (10ssingh) I meant we set `profile::netconsole::client::ensure: absent` in `hieradata/role/common/cache/upload.yaml` so it should not be enabled there. [15:16:30] (03CR) 10Ssingh: [C:03+1] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/1295431 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [15:20:39] (03PS1) 10Andrew Bogott: magnum: remove refs to helm chart repo [puppet] - 10https://gerrit.wikimedia.org/r/1295472 (https://phabricator.wikimedia.org/T393782) [15:26:00] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295472 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [15:26:38] !log dancy@deploy1003 Installing scap version "4.267.0" for 2 host(s) [15:28:30] !log dancy@deploy1003 Installation of scap version "4.267.0" completed for 2 hosts [15:30:44] (03CR) 10Andrew Bogott: [C:03+2] magnum: remove refs to helm chart repo [puppet] - 10https://gerrit.wikimedia.org/r/1295472 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [15:34:20] (03PS1) 10Andrew Bogott: magnum: enable magnum-cluster-api driver in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1295473 (https://phabricator.wikimedia.org/T393782) [15:36:04] (03PS2) 10Andrew Bogott: magnum: enable magnum-cluster-api driver in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1295473 (https://phabricator.wikimedia.org/T393782) [15:38:38] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295473 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [15:41:03] PROBLEM - Host ml-serve1014 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:33] RECOVERY - Host ml-serve1014 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [15:46:12] (03PS1) 10Andrew Bogott: Add fake k3s config for eqiad1 magnum [labs/private] - 10https://gerrit.wikimedia.org/r/1295474 [15:48:08] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add fake k3s config for eqiad1 magnum [labs/private] - 10https://gerrit.wikimedia.org/r/1295474 (owner: 10Andrew Bogott) [15:48:43] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295473 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [15:50:42] (03CR) 10Andrew Bogott: [C:03+2] magnum: enable magnum-cluster-api driver in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1295473 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [15:50:45] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.3.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295475 (https://phabricator.wikimedia.org/T427543) [15:51:19] (03PS2) 10Clare Ming: Test Kitchen UI: Deploy v1.3.9 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295475 (https://phabricator.wikimedia.org/T427543) [15:54:49] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.3.9 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295475 (https://phabricator.wikimedia.org/T427543) (owner: 10Clare Ming) [15:56:57] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.9 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295475 (https://phabricator.wikimedia.org/T427543) (owner: 10Clare Ming) [15:57:15] (03PS1) 10Andrew Bogott: magnum: create directory to hold /var/lib/magnum/.kube/config [puppet] - 10https://gerrit.wikimedia.org/r/1295477 (https://phabricator.wikimedia.org/T426431) [15:57:47] (03CR) 10CI reject: [V:04-1] magnum: create directory to hold /var/lib/magnum/.kube/config [puppet] - 10https://gerrit.wikimedia.org/r/1295477 (https://phabricator.wikimedia.org/T426431) (owner: 10Andrew Bogott) [15:58:27] (03PS2) 10Andrew Bogott: magnum: create directory to hold /var/lib/magnum/.kube/config [puppet] - 10https://gerrit.wikimedia.org/r/1295477 (https://phabricator.wikimedia.org/T426431) [15:59:45] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1295477 (https://phabricator.wikimedia.org/T426431) (owner: 10Andrew Bogott) [16:02:02] (03CR) 10Andrew Bogott: [C:03+2] magnum: create directory to hold /var/lib/magnum/.kube/config [puppet] - 10https://gerrit.wikimedia.org/r/1295477 (https://phabricator.wikimedia.org/T426431) (owner: 10Andrew Bogott) [16:08:01] 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968#11968118 (10ssingh) >>! In T426968#11965021, @BCornwall wrote: > @ssingh What made you suspect mem errors? I see from the previous boot that OOM kept getting invoked on purged but I suspect that from some non-hardware issue... [16:08:55] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:12:45] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:13:14] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:13:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Investigate hardware RAID usage in codfw LVS hosts - https://phabricator.wikimedia.org/T426912#11968120 (10ssingh) >>! In T426912#11964942, @BCornwall wrote: > @ssingh @BBlack Okay with me switching write-back to write-through slowly through the codfw cluster or sha... [16:14:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:44] (03CR) 10Dzahn: "found in full raw compiler output: "mysql_master_port":"3306","mysql_slave":"m3-slave.codfw.wmnet","mysql_slave_port":"3323 so it's using" [puppet] - 10https://gerrit.wikimedia.org/r/1295460 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [16:21:39] (03CR) 10Dzahn: [V:03+1 C:03+2] zuul: replace user/group setup with systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1286999 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [16:26:45] (03PS1) 10Jgreen: Switch fundraisingdb-read.wmnet alias to frdb1008 [dns] - 10https://gerrit.wikimedia.org/r/1295481 (https://phabricator.wikimedia.org/T423950) [16:28:24] (03CR) 10Jgreen: [C:03+2] Switch fundraisingdb-read.wmnet alias to frdb1008 [dns] - 10https://gerrit.wikimedia.org/r/1295481 (https://phabricator.wikimedia.org/T423950) (owner: 10Jgreen) [16:28:58] !log jgreen@dns1004 START - running authdns-update [16:30:45] !log jgreen@dns1004 END - running authdns-update [16:30:52] (03PS1) 10Jasmine: k8s: add new stacked control planes wikikube-ctrl100[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/1295483 (https://phabricator.wikimedia.org/T418920) [16:33:55] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:41:05] PROBLEM - Host ml-serve1014 is DOWN: PING CRITICAL - Packet loss = 100% [16:42:33] RECOVERY - Host ml-serve1014 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:43:32] (03CR) 10Dzahn: [C:03+2] CI: better naming; avoid using terms "new" and "legacy" [puppet] - 10https://gerrit.wikimedia.org/r/1294392 (owner: 10Dzahn) [16:51:32] (03CR) 10Dzahn: [C:03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1294392 (owner: 10Dzahn) [16:57:17] Hey all I need to regretably do a Friday deploy. I am just waiting on a code review. Any problems with proceeding in the next hour? [16:58:08] Jdlrobson: No objection from releng. Please get approval from someone in SRE too. [16:59:56] (03PS1) 10Jdlrobson: Do not load experiment if not active and no assigned group [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295487 [17:00:20] (03CR) 10CDanis: [C:03+1] Do not load experiment if not active and no assigned group [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295487 (owner: 10Jdlrobson) [17:00:26] Jdlrobson: +1 on both patch and deploy [17:00:43] Thx cdanis [17:01:27] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for caro - https://phabricator.wikimedia.org/T426995#11968262 (10Dzahn) [17:01:49] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for caro - https://phabricator.wikimedia.org/T426995#11968265 (10Dzahn) @VPuffetMichel Hi, this request needs your manager approval. Cheers, Daniel [17:02:19] !oncall-now [17:02:20] Oncall now for team SRE, rotation 247_policy: [17:02:20] j.hathaway, t.appof [17:03:34] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for caro - https://phabricator.wikimedia.org/T426995#11968266 (10Dzahn) @medelius Hi, please create an SSH key and put the public part here on the ticket. Cheers, Daniel [17:07:01] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for Audrey Penven - https://phabricator.wikimedia.org/T427531#11968269 (10Dzahn) [17:08:33] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting Access to Analytics Data Lake for Dlulisa-WMF - https://phabricator.wikimedia.org/T427197#11968274 (10Dzahn) [17:09:13] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:11:21] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11968279 (10Dzahn) [17:11:51] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11968294 (10Dzahn) [17:14:54] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11968320 (10Dzahn) Hi @thcipriani This request requires your approcal as group approver for "restricted". [17:20:29] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:20:41] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:21:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:22:23] (03PS2) 10Jdlrobson: Hide experiment if not active and no assigned group [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295487 [17:23:18] 07Puppet, 06collaboration-services, 10Gerrit, 06Infrastructure-Foundations: Change puppet-merge git origin to use gerrit.discovery.wmnet instead of gerrit.wikimedia.org - https://phabricator.wikimedia.org/T420184#11968357 (10Dzahn) The string "gerrit" does not appear anywhere under `/etc/puppet-merge` or i... [17:23:40] Thanks cdanis dancy - just waiting for a +2 on the master branch [17:23:44] then i'll proceed [17:24:41] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:27:29] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:27:47] (03PS2) 10Aleksandar Mastilovic: dse-k8s-services Blunderbuss: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295100 (https://phabricator.wikimedia.org/T423573) [17:28:41] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:29:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:30:41] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:30:45] ok gonna begin now cdanis [17:31:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295487 (owner: 10Jdlrobson) [17:31:50] ^ fyi @jhathaway @tappof [17:32:29] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:32:54] (03Merged) 10jenkins-bot: Hide experiment if not active and no assigned group [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295487 (owner: 10Jdlrobson) [17:33:13] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1295487|Hide experiment if not active and no assigned group]] [17:34:57] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1295487|Hide experiment if not active and no assigned group]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:35:56] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [17:40:07] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295487|Hide experiment if not active and no assigned group]] (duration: 06m 54s) [17:40:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:41:39] k all done! Thank you! [17:48:31] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [17:56:45] (03PS1) 10Dzahn: codesearch: stop confd spamming syslog [puppet] - 10https://gerrit.wikimedia.org/r/1295492 [18:08:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission lvs1016.eqiad.wmnet - https://phabricator.wikimedia.org/T427451#11968436 (10Jclark-ctr) [18:08:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission lvs1016.eqiad.wmnet - https://phabricator.wikimedia.org/T427451#11968437 (10Jclark-ctr) 05Open→03Resolved [18:10:05] (03PS3) 10Jasmine: kafka-main2010: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288921 (https://phabricator.wikimedia.org/T427088) [18:19:20] (03PS2) 10Dzahn: codesearch: stop confd spamming syslog [puppet] - 10https://gerrit.wikimedia.org/r/1295492 (https://phabricator.wikimedia.org/T417458) [18:20:00] (03CR) 10Dzahn: [C:03+2] codesearch: stop confd spamming syslog [puppet] - 10https://gerrit.wikimedia.org/r/1295492 (https://phabricator.wikimedia.org/T417458) (owner: 10Dzahn) [18:39:01] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11968519 (10VRiley-WMF) I'll take a deeper look into this. It's okay to reboot, correct? [19:11:43] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11968593 (10thcipriani) >>! In T427597#11968319, @Dzahn wrote: > Hi @thcipriani This request requires your approval as group approver for "restricted". Reason for access i... [19:13:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:13:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [19:18:31] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [19:26:31] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting access to Analytics Data Lake for kevmon/kmontalva-wmf - https://phabricator.wikimedia.org/T427279#11968647 (10JArguello-WMF) Thanks @atsuko ! I'0m confirming with the team, will let you know asap. [20:09:09] (03CR) 10Brouberol: [C:03+1] dse-k8s-services Blunderbuss: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295100 (https://phabricator.wikimedia.org/T423573) (owner: 10Aleksandar Mastilovic) [20:12:35] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11968827 (10Dzahn) [20:13:05] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11968828 (10Dzahn) [20:22:03] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11968861 (10Dzahn) @mahmoud.abdelsattar.wmde Now that you got all the approvals we just need one more step. We need to verify your SSH key in some way outside of this ticke... [20:29:32] (03PS1) 10Dzahn: gerrit: use stunnel with rsync of lfs data [puppet] - 10https://gerrit.wikimedia.org/r/1295500 (https://phabricator.wikimedia.org/T412780) [20:32:14] (03PS2) 10Dzahn: gerrit: use stunnel with rsync of lfs data [puppet] - 10https://gerrit.wikimedia.org/r/1295500 (https://phabricator.wikimedia.org/T412780) [20:33:18] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1295500/8619/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1295500 (https://phabricator.wikimedia.org/T412780) (owner: 10Dzahn) [20:37:57] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for Audrey Penven - https://phabricator.wikimedia.org/T427531#11968922 (10Dzahn) Hello @AudreyPenven_WMDE I just sent you an email to verify your request and SSH key outside of this ticket, which is one of the requirements. Please take a look, cheer... [20:38:35] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for Audrey Penven - https://phabricator.wikimedia.org/T427531#11968924 (10Dzahn) @thcipriani Hi, here is another request for the 'restricted' group. [20:38:57] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for Audrey Penven - https://phabricator.wikimedia.org/T427531#11968925 (10Dzahn) [20:39:35] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for Audrey Penven - https://phabricator.wikimedia.org/T427531#11968926 (10Dzahn) 05Open→03In progress [20:39:58] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting Access to Analytics Data Lake for Dlulisa-WMF - https://phabricator.wikimedia.org/T427197#11968938 (10Dzahn) 05Open→03In progress [20:40:09] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting access to Analytics Data Lake for kevmon/kmontalva-wmf - https://phabricator.wikimedia.org/T427279#11968940 (10Dzahn) 05Open→03In progress [20:40:56] 06SRE, 10SRE-Access-Requests: Requesting access to [restricted] for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T427597#11968945 (10Dzahn) 05Open→03In progress [20:58:15] (03CR) 10Aleksandar Mastilovic: [C:03+1] dse-k8s-services Blunderbuss: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295100 (https://phabricator.wikimedia.org/T423573) (owner: 10Aleksandar Mastilovic) [21:09:13] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:23:08] (03PS1) 10Bartosz Dziewoński: Revert "labswiki: Disallow account autocreation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 [21:23:18] (03CR) 10CI reject: [V:04-1] Revert "labswiki: Disallow account autocreation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 (owner: 10Bartosz Dziewoński) [21:24:27] (03PS2) 10Bartosz Dziewoński: Revert "labswiki: Disallow account autocreation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295502 [21:33:57] (03PS1) 10Catrope: passwordlessLogin: Don't immediately error out in unsupported browsers [extensions/OATHAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295504 (https://phabricator.wikimedia.org/T427562) [21:34:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/OATHAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1295504 (https://phabricator.wikimedia.org/T427562) (owner: 10Catrope) [21:41:20] !log catrope@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [21:42:10] !log catrope@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [21:49:50] (03CR) 10Jasmine: "Right, that makes sense! I've replaced the host level override here with the hiera keys at cluster level, and will remove the host level o" [puppet] - 10https://gerrit.wikimedia.org/r/1288921 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [22:44:11] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11969114 (10VRiley-WMF) @MatthewVernon Is there a time we can schedual some of these Thanos and msbe servers? [23:13:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:13:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:37:59] !log aokoth@cumin1003 START - Cookbook sre.vrts.upgrade on VRTS host vrts1003.eqiad.wmnet [23:39:35] !log aokoth@cumin1003 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0) on VRTS host vrts1003.eqiad.wmnet [23:39:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1295509 [23:39:42] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1295509 (owner: 10TrainBranchBot) [23:51:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1295509 (owner: 10TrainBranchBot)