[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184614 (owner: 10TrainBranchBot) [00:07:15] !log ryankemper@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply new opensearch plugins pkg - ryankemper@cumin1002 - T403749 [00:07:18] T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749 [00:08:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184953 [00:08:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184953 (owner: 10TrainBranchBot) [00:26:00] (03CR) 10RLazarus: [C:03+1] P:rsyslog::kafka_shipper: configure output lookup for php8.3-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1184942 (owner: 10Scott French) [00:32:53] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1184953 (owner: 10TrainBranchBot) [00:46:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:03:57] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:04] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:30:55] (03PS1) 10Scott French: tests: provide a user-agent in test_runbook_exists [alerts] - 10https://gerrit.wikimedia.org/r/1184957 [01:33:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T402925)', diff saved to https://phabricator.wikimedia.org/P82598 and previous config saved to /var/cache/conftool/dbconfig/20250905-013307-ladsgroup.json [01:33:12] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:35:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:48:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P82599 and previous config saved to /var/cache/conftool/dbconfig/20250905-014815-ladsgroup.json [01:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:52:56] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply new opensearch plugins pkg - ryankemper@cumin1002 - T403749 [01:53:00] T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749 [01:55:52] !log ryankemper@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 55 hosts with reason: rolling restart cirrus eqiad [01:57:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:03:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P82600 and previous config saved to /var/cache/conftool/dbconfig/20250905-020323-ladsgroup.json [02:07:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:17:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:18:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T402925)', diff saved to https://phabricator.wikimedia.org/P82601 and previous config saved to /var/cache/conftool/dbconfig/20250905-021830-ladsgroup.json [02:18:35] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:18:46] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2199.codfw.wmnet with reason: Maintenance [02:58:57] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:13:57] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:17] !log ryankemper@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply new opensearch plugins pkg - ryankemper@cumin1002 - T403749 [03:19:21] T403749: Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters - https://phabricator.wikimedia.org/T403749 [04:50:13] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2206.codfw.wmnet with reason: Maintenance [04:50:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2206 (T402925)', diff saved to https://phabricator.wikimedia.org/P82602 and previous config saved to /var/cache/conftool/dbconfig/20250905-045020-ladsgroup.json [04:50:24] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [05:03:57] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:04] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:08:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:33:57] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:43:27] (03CR) 10Arnaudb: "thanks for considering it! I'm currently trying to do the same implementation with mtail as recommended on IRC by @cwhite@wikimedia.org. I" [puppet] - 10https://gerrit.wikimedia.org/r/1184756 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [05:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:54:44] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 139628 [05:55:51] 06SRE-OnFire, 06Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Juniper: regularly run `request system configuration rescue save` - https://phabricator.wikimedia.org/T376005#11150852 (10ayounsi) a:05ayounsi→03None [05:58:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:58:20] ayounsi@cumin1003 peering (PID 1098651) is awaiting input [05:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250905T0600) [06:01:43] ayounsi@cumin1003 peering (PID 1098651) is awaiting input [06:04:18] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 139628 [06:13:39] (03PS1) 10Slyngshede: P:cache::haproxy allow datacenter information to be disabled [puppet] - 10https://gerrit.wikimedia.org/r/1184966 (https://phabricator.wikimedia.org/T403616) [06:16:40] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6858/console" [puppet] - 10https://gerrit.wikimedia.org/r/1184966 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [06:18:25] (03PS1) 10Muehlenhoff: Add replacement insetup VMS for VMs currently running on esams02 [puppet] - 10https://gerrit.wikimedia.org/r/1184967 (https://phabricator.wikimedia.org/T402259) [06:21:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:23:40] (03CR) 10Muehlenhoff: [C:03+2] Add replacement insetup VMS for VMs currently running on esams02 [puppet] - 10https://gerrit.wikimedia.org/r/1184967 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [06:26:07] (03PS1) 10Kosta Harlan: hCaptcha: Update secure enclave API endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184968 [06:26:21] jouncebot: nowandnext [06:26:22] For the next 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250905T0600) [06:26:22] In 0 hour(s) and 33 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250905T0700) [06:29:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184968 (owner: 10Kosta Harlan) [06:29:54] (03Merged) 10jenkins-bot: hCaptcha: Update secure enclave API endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184968 (owner: 10Kosta Harlan) [06:30:09] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1184968|hCaptcha: Update secure enclave API endpoint]] [06:34:45] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install2004.wikimedia.org [06:36:03] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1184968|hCaptcha: Update secure enclave API endpoint]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [06:38:26] !log kharlan@deploy1003 Sync cancelled. [06:39:27] (03PS1) 10Kosta Harlan: Revert "hCaptcha: Update secure enclave API endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184969 [06:39:44] I canceled the sync -- I guess I need to merge the revert change above? if someone can advise, I'd appreciate it. [06:39:55] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [06:43:46] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install2004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [06:44:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install2004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [06:44:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:44:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install2004.wikimedia.org [06:44:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:44:54] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11150895 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install2004.wikimedia.org` - install2004.wikimedia.org (**PASS**) - Do... [06:45:32] (03PS3) 10Slyngshede: P:cache::haproxy avoid hardcoding wme ranges [puppet] - 10https://gerrit.wikimedia.org/r/1184772 [06:46:03] (03CR) 10Ayounsi: "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1184791 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [06:46:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 6.468 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:46:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh3006.wikimedia.org [06:46:33] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [06:46:44] (03CR) 10Ayounsi: "Can you run PCC for this ? But the change seems like a nice step forward !" [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [06:50:21] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3006.wikimedia.org - jmm@cumin2002" [06:50:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3006.wikimedia.org - jmm@cumin2002" [06:50:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:50:26] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh3006.wikimedia.org on all recursors [06:50:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh3006.wikimedia.org on all recursors [06:51:02] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3006.wikimedia.org - jmm@cumin2002" [06:51:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3006.wikimedia.org - jmm@cumin2002" [06:56:19] jmm@cumin2002 makevm (PID 2613787) is awaiting input [06:56:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh3006.wikimedia.org with OS bookworm [06:56:48] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11150902 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host doh3006.wikimedia.org with OS bookworm [06:58:57] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:59:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11150903 (10MoritzMuehlenhoff) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250905T0700) [07:00:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11150904 (10MoritzMuehlenhoff) [07:02:50] (03CR) 10Muehlenhoff: bird: use LINK_LOCAL sets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [07:06:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:11:29] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.024 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:13:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2012.codfw.wmnet with OS bookworm [07:14:05] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11150954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps2012.codfw.wmnet with OS bookworm [07:18:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh3006.wikimedia.org with reason: host reimage [07:23:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh3006.wikimedia.org with reason: host reimage [07:24:31] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:31:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:32:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T402925)', diff saved to https://phabricator.wikimedia.org/P82603 and previous config saved to /var/cache/conftool/dbconfig/20250905-073225-ladsgroup.json [07:32:30] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [07:33:58] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2012.codfw.wmnet with reason: host reimage [07:38:28] (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy allow datacenter information to be disabled [puppet] - 10https://gerrit.wikimedia.org/r/1184966 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [07:38:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2012.codfw.wmnet with reason: host reimage [07:39:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 1.835 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:40:18] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:cache::haproxy allow datacenter information to be disabled [puppet] - 10https://gerrit.wikimedia.org/r/1184966 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [07:41:40] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:42:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh3006.wikimedia.org with OS bookworm [07:42:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh3006.wikimedia.org [07:42:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11151002 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host doh3006.wikimedia.org with OS bookworm completed: - doh3006... [07:45:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3007.esams.wmnet [07:46:18] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11151018 (10ops-monitoring-bot) Draining ganeti3007.esams.wmnet of running VMs [07:46:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3007.esams.wmnet [07:47:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3007.esams.wmnet [07:47:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P82604 and previous config saved to /var/cache/conftool/dbconfig/20250905-074733-ladsgroup.json [07:47:35] (03PS2) 10Filippo Giunchedi: wmcs: port ::instance to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) [07:47:35] (03PS2) 10Filippo Giunchedi: bird: use LINK_LOCAL sets [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) [07:48:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11151026 (10ops-monitoring-bot) Draining ganeti3007.esams.wmnet of running VMs [07:49:22] (03CR) 10Filippo Giunchedi: bird: use LINK_LOCAL sets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [07:52:11] (03CR) 10Filippo Giunchedi: "About to finish: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184793" [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [07:52:25] (03CR) 10Fabfur: P:cache:haproxy add fetch_is_datacenter lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182763 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:52:28] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 7 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1184793 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [07:58:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2012.codfw.wmnet with OS bookworm [07:58:36] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11151072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps2012.codfw.wmnet with OS bookworm completed: - maps2012 (**PASS**) - Downt... [07:59:09] (03PS1) 10Muehlenhoff: Make doh3006 a wikidough node [puppet] - 10https://gerrit.wikimedia.org/r/1185047 (https://phabricator.wikimedia.org/T402259) [07:59:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [07:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:00:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2013.codfw.wmnet with OS bookworm [08:00:32] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11151081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps2013.codfw.wmnet with OS bookworm [08:02:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P82605 and previous config saved to /var/cache/conftool/dbconfig/20250905-080241-ladsgroup.json [08:02:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [08:04:32] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:05:01] !log Restarted CI Jenkins to update plugins [08:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:20] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11151097 (10Krd) Please unbreak again. [08:06:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [08:08:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3007.esams.wmnet [08:09:22] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:10:38] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti3007.esams.wmnet [08:11:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [08:11:50] CirrusSearch consumer-search@codfw is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=codfw&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [08:12:02] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [08:12:08] CirrusSearch consumer-search@codfw is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=codfw&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [08:12:12] looking ^ [08:16:45] FIRING: [3x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [08:16:53] (03PS1) 10Muehlenhoff: maps/bookworm: Re-enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1185048 (https://phabricator.wikimedia.org/T381565) [08:16:54] FIRING: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [08:17:04] (03PS2) 10Muehlenhoff: maps/bookworm: Re-enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1185048 (https://phabricator.wikimedia.org/T381565) [08:17:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T402925)', diff saved to https://phabricator.wikimedia.org/P82606 and previous config saved to /var/cache/conftool/dbconfig/20250905-081748-ladsgroup.json [08:17:53] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [08:18:04] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2210.codfw.wmnet with reason: Maintenance [08:18:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2210 (T402925)', diff saved to https://phabricator.wikimedia.org/P82607 and previous config saved to /var/cache/conftool/dbconfig/20250905-081811-ladsgroup.json [08:20:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2013.codfw.wmnet with reason: host reimage [08:22:26] (03PS1) 10Elukey: services: exclude postgres masters from confs in tegola/kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185049 (https://phabricator.wikimedia.org/T381565) [08:24:26] (03PS1) 10Muehlenhoff: maps: Remove unused Hiera option [puppet] - 10https://gerrit.wikimedia.org/r/1185051 (https://phabricator.wikimedia.org/T381565) [08:24:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2013.codfw.wmnet with reason: host reimage [08:24:40] (03PS2) 10Elukey: services: exclude postgres masters from confs in tegola/kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185049 (https://phabricator.wikimedia.org/T381565) [08:27:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1185051 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:28:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185049 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:29:18] (03CR) 10Elukey: [C:03+1] maps: Remove unused Hiera option [puppet] - 10https://gerrit.wikimedia.org/r/1185051 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:30:23] jmm@cumin2002 upgrade-firmware (PID 2659734) is awaiting input [08:34:02] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [08:36:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3007.esams.wmnet [08:39:37] (03CR) 10Ayounsi: [C:03+1] Make doh3006 a wikidough node [puppet] - 10https://gerrit.wikimedia.org/r/1185047 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [08:42:02] 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2025/26-Q1): cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11151264 (10fnegri) [08:42:27] 06SRE, 06serviceops, 10cloud-services-team (FY2025/26-Q1): Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#11151284 (10fnegri) [08:44:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2013.codfw.wmnet with OS bookworm [08:44:25] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11151313 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps2013.codfw.wmnet with OS bookworm completed: - maps2013 (**PASS**) - Downt... [08:47:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2014.codfw.wmnet with OS bookworm [08:47:38] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11151320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps2014.codfw.wmnet with OS bookworm [08:48:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3007.esams.wmnet [08:48:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti3007.esams.wmnet [08:48:59] !log remove dbg packages & repool ms-fe2016 T360913 [08:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:02] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [08:49:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:50:36] (03CR) 10Btullis: mediawiki-dumps-legacy: Use in-pod mcrouter container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) (owner: 10Effie Mouzeli) [08:51:29] 10SRE-SLO: Pyrra calculations for the Initial error budget value of calendar windows - https://phabricator.wikimedia.org/T403729#11151329 (10elukey) Current doubts: 1) We could lower down the window parameter set in Pyrra, from 4w to something like 1d, and we after that we should be less dependent on the "past"... [08:53:15] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye [08:53:37] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1236.eqiad.wmnet with OS bullseye [08:54:31] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:03:57] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:10] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:04:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:05:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11151358 (10MoritzMuehlenhoff) Thanks for the quick turnaround, much appreciated! [09:06:43] 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 doesn't come back up during reimage - https://phabricator.wikimedia.org/T403375#11151364 (10MoritzMuehlenhoff) 05Open→03Resolved >>! In T403375#11140589, @RobH wrote: > After updating the idrac, bios, and backplane firmware and resetting & then allowing the sy... [09:06:43] (03PS1) 10Brouberol: airflow: deploy a standalone statsd component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185054 (https://phabricator.wikimedia.org/T403701) [09:07:19] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:07:30] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2014.codfw.wmnet with reason: host reimage [09:11:21] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:12:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2014.codfw.wmnet with reason: host reimage [09:16:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:17:12] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1235.eqiad.wmnet with reason: host reimage [09:19:12] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1236.eqiad.wmnet with reason: host reimage [09:21:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:22:37] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:23:13] !log btullis@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster dse-codfw: Kubernetes upgrade [09:23:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1235.eqiad.wmnet with reason: host reimage [09:27:21] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185054 (https://phabricator.wikimedia.org/T403701) (owner: 10Brouberol) [09:27:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1236.eqiad.wmnet with reason: host reimage [09:29:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:30:50] btullis@cumin1003 wipe-cluster (PID 1119531) is awaiting input [09:31:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2014.codfw.wmnet with OS bookworm [09:31:54] (03PS2) 10Brouberol: airflow: deploy a standalone statsd component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185054 (https://phabricator.wikimedia.org/T403701) [09:32:02] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11151394 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps2014.codfw.wmnet with OS bookworm completed: - maps2014 (**PASS**) - Downt... [09:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:34:37] (03PS1) 10Superpes15: Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185055 (https://phabricator.wikimedia.org/T391009) [09:35:33] (03CR) 10CI reject: [V:04-1] Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185055 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [09:36:16] (03PS1) 10Btullis: Remove a stray reference to PSP in dse-k8s-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1185056 (https://phabricator.wikimedia.org/T397301) [09:36:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:37:01] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6860/co" [puppet] - 10https://gerrit.wikimedia.org/r/1185056 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [09:37:02] (03CR) 10Stevemunene: [C:03+1] Remove a stray reference to PSP in dse-k8s-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1185056 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [09:37:42] (03CR) 10Btullis: [V:03+1 C:03+2] Remove a stray reference to PSP in dse-k8s-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1185056 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [09:37:44] (03PS6) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [09:37:49] (03CR) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API (036 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [09:39:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:39:59] (03CR) 10Brouberol: [C:03+2] airflow: deploy a standalone statsd component [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185054 (https://phabricator.wikimedia.org/T403701) (owner: 10Brouberol) [09:40:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1235.eqiad.wmnet with OS bullseye [09:40:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:41:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:42:03] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:43:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:43:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:43:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:44:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1236.eqiad.wmnet with OS bullseye [09:45:43] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [09:46:00] btullis@cumin1003 wipe-cluster (PID 1119531) is awaiting input [09:46:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:49:25] (03PS1) 10Elukey: [DNM] provision: remove some idrac10 cpu settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1185057 [09:51:37] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [09:54:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11151421 (10elukey) >>! In T392851#11149517, @Jhancock.wm wrote: > @elukey > > okay so what i did today in terms of firmware updates is: > cp2044 BIOS, iDRAC,... [09:58:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:58:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.wipe-cluster (exit_code=0) Wipe the K8s cluster dse-codfw: Kubernetes upgrade [09:58:57] RESOLVED: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:01:19] (03PS1) 10Superpes15: Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) [10:01:21] (03PS1) 10Elukey: sre.hosts.provision: fix check for idrac10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1185059 (https://phabricator.wikimedia.org/T392851) [10:03:02] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:08:48] (03PS2) 10Superpes15: Initial configuration for arbcom_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185058 (https://phabricator.wikimedia.org/T391009) [10:08:56] 06SRE, 06Traffic, 10MediaWiki-Platform-Team (Radar): Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11151427 (10Vgutierrez) FWIW HAProxy provides UUIDv4 out of the box so it should be as easy as `http-request set-header X-Request-Id %[uu... [10:09:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:11:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:11:36] (03CR) 10Hnowlan: [C:03+1] tests: provide a user-agent in test_runbook_exists [alerts] - 10https://gerrit.wikimedia.org/r/1184957 (owner: 10Scott French) [10:14:39] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.699 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:16:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:20:11] (03PS1) 10Btullis: Correct the ASN for the dse-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185061 (https://phabricator.wikimedia.org/T397301) [10:23:25] (03CR) 10Stevemunene: [C:03+1] "Looks good, Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185061 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [10:23:32] (03CR) 10Elukey: [C:03+2] "Self-merging after testing to avoid leaving provisioning broken. Please let me know later on if you don't like the change!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1185059 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [10:23:50] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host cp2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:24:03] (03PS2) 10Elukey: [DNM] provision: remove some idrac10 cpu settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1185057 [10:28:06] (03CR) 10Btullis: [C:03+2] Correct the ASN for the dse-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185061 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [10:28:39] (03PS1) 10Dreamy Jazz: SECURITY: Appropriately filter the recentchanges views [puppet] - 10https://gerrit.wikimedia.org/r/1185062 (https://phabricator.wikimedia.org/T402283) [10:29:26] (03CR) 10FNegri: [C:03+2] "Reviewed and approved in task, merging." [puppet] - 10https://gerrit.wikimedia.org/r/1185062 (https://phabricator.wikimedia.org/T402283) (owner: 10Dreamy Jazz) [10:29:33] (03CR) 10FNegri: [V:03+2 C:03+2] SECURITY: Appropriately filter the recentchanges views [puppet] - 10https://gerrit.wikimedia.org/r/1185062 (https://phabricator.wikimedia.org/T402283) (owner: 10Dreamy Jazz) [10:29:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:29:52] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:31:08] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:31:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:32:11] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:32:26] (03Abandoned) 10Btullis: dse-k8s: disable cluster_dns to allow core-dns deploy. [puppet] - 10https://gerrit.wikimedia.org/r/1183691 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [10:33:20] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:35:10] (03Merged) 10jenkins-bot: Correct the ASN for the dse-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185061 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [10:36:55] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [10:37:16] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [10:41:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 6.422 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:44:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:44:49] (03PS1) 10Dreamy Jazz: Follow-up: Appropriately filter the recentchanges views [puppet] - 10https://gerrit.wikimedia.org/r/1185064 (https://phabricator.wikimedia.org/T402283) [10:45:05] (03PS2) 10Dreamy Jazz: Follow-up: Appropriately filter the recentchanges views [puppet] - 10https://gerrit.wikimedia.org/r/1185064 (https://phabricator.wikimedia.org/T402283) [10:45:06] (03CR) 10CI reject: [V:04-1] Follow-up: Appropriately filter the recentchanges views [puppet] - 10https://gerrit.wikimedia.org/r/1185064 (https://phabricator.wikimedia.org/T402283) (owner: 10Dreamy Jazz) [10:45:08] (03PS1) 10Btullis: Revert "Correct the ASN for the dse-k8s-codfw cluster" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185065 [10:46:00] (03CR) 10Stevemunene: [C:03+1] Revert "Correct the ASN for the dse-k8s-codfw cluster" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185065 (owner: 10Btullis) [10:47:26] (03CR) 10FNegri: [C:03+2] Follow-up: Appropriately filter the recentchanges views [puppet] - 10https://gerrit.wikimedia.org/r/1185064 (https://phabricator.wikimedia.org/T402283) (owner: 10Dreamy Jazz) [10:48:35] (03PS1) 10Btullis: Update the ASN value for the dse-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185066 (https://phabricator.wikimedia.org/T397301) [10:49:33] (03CR) 10Stevemunene: [C:03+1] Update the ASN value for the dse-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185066 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [10:51:26] (03PS1) 10Slyngshede: P:cache:haproxy guard datacenter database with if statement [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) [10:52:22] (03CR) 10Btullis: [C:03+2] Revert "Correct the ASN for the dse-k8s-codfw cluster" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185065 (owner: 10Btullis) [10:53:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182621 (https://phabricator.wikimedia.org/T398057) (owner: 10KCVelaga) [10:55:42] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6861/console" [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [10:55:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T402925)', diff saved to https://phabricator.wikimedia.org/P82609 and previous config saved to /var/cache/conftool/dbconfig/20250905-105544-ladsgroup.json [10:55:48] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:56:01] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:57:22] (03CR) 10Vgutierrez: [C:03+1] P:cache:haproxy guard datacenter database with if statement (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [10:59:39] (03Merged) 10jenkins-bot: Revert "Correct the ASN for the dse-k8s-codfw cluster" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185065 (owner: 10Btullis) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250905T0700) [11:00:04] jelto, arnoldokoth, and mutante: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250905T1100). [11:00:57] elukey@cumin1003 provision (PID 1134632) is awaiting input [11:05:43] (03CR) 10Stevemunene: [C:03+2] Update the ASN value for the dse-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185066 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [11:06:28] (03PS6) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) [11:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:10:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P82610 and previous config saved to /var/cache/conftool/dbconfig/20250905-111052-ladsgroup.json [11:12:25] (03CR) 10Cathal Mooney: Nokia: Add support for Python config generation and JSON-RPC API (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [11:12:39] (03PS2) 10Slyngshede: P:cache:haproxy prevent download of datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) [11:13:01] (03Merged) 10jenkins-bot: Update the ASN value for the dse-k8s-codfw cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185066 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [11:15:54] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [11:16:02] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [11:16:31] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6862/console" [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [11:18:41] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [11:20:28] (03PS3) 10Slyngshede: P:cache:haproxy prevent download of datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) [11:21:17] elukey@cumin1003 provision (PID 1134632) is awaiting input [11:21:21] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6863/co" [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [11:22:45] (03PS4) 10Slyngshede: P:cache:haproxy prevent download of datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) [11:23:37] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6864/console" [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [11:24:06] (03CR) 10Slyngshede: P:cache:haproxy prevent download of datacenter.mmdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [11:26:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P82611 and previous config saved to /var/cache/conftool/dbconfig/20250905-112559-ladsgroup.json [11:30:13] (03PS5) 10Slyngshede: P:cache:haproxy prevent download of datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) [11:31:50] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [11:33:41] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [11:41:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T402925)', diff saved to https://phabricator.wikimedia.org/P82612 and previous config saved to /var/cache/conftool/dbconfig/20250905-114107-ladsgroup.json [11:41:11] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:41:22] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2219.codfw.wmnet with reason: Maintenance [11:41:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2219 (T402925)', diff saved to https://phabricator.wikimedia.org/P82613 and previous config saved to /var/cache/conftool/dbconfig/20250905-114129-ladsgroup.json [11:51:37] (03PS1) 10Brouberol: airflow: change the statsd service type into clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185072 (https://phabricator.wikimedia.org/T403701) [11:51:45] RESOLVED: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [11:53:23] (03PS1) 10Btullis: Enabled prometheus support for dse-k8s-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1185074 (https://phabricator.wikimedia.org/T397301) [11:58:11] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1233.eqiad.wmnet with OS bullseye [11:58:45] FIRING: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [11:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:59:54] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [12:02:11] (03PS1) 10DCausse: cirrus-streaming-updater: use new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185075 (https://phabricator.wikimedia.org/T372912) [12:02:26] (03PS2) 10Btullis: Enable prometheus support for dse-k8s-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1185074 (https://phabricator.wikimedia.org/T397301) [12:02:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [12:03:51] (03CR) 10Brouberol: [C:03+1] Enable prometheus support for dse-k8s-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1185074 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [12:04:02] (03CR) 10Btullis: [C:03+1] airflow: change the statsd service type into clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185072 (https://phabricator.wikimedia.org/T403701) (owner: 10Brouberol) [12:04:18] (03CR) 10Brouberol: [C:03+2] airflow: change the statsd service type into clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185072 (https://phabricator.wikimedia.org/T403701) (owner: 10Brouberol) [12:04:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:06:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [12:06:46] (03CR) 10Stevemunene: [C:03+1] Enable prometheus support for dse-k8s-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1185074 (https://phabricator.wikimedia.org/T397301) (owner: 10Btullis) [12:10:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host durum3006.esams.wmnet [12:10:13] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:14:15] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3006.esams.wmnet - jmm@cumin2002" [12:14:39] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.145 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:15:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3006.esams.wmnet - jmm@cumin2002" [12:15:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:15:10] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache durum3006.esams.wmnet on all recursors [12:15:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum3006.esams.wmnet on all recursors [12:15:45] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3006.esams.wmnet - jmm@cumin2002" [12:15:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3006.esams.wmnet - jmm@cumin2002" [12:17:00] FIRING: [3x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [12:17:37] jouncebot: nowandnext [12:17:37] For the next 18 hour(s) and 42 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250905T0700) [12:17:37] In 18 hour(s) and 42 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250906T0700) [12:17:43] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: use new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185075 (https://phabricator.wikimedia.org/T372912) (owner: 10DCausse) [12:18:51] jmm@cumin2002 makevm (PID 2780597) is awaiting input [12:19:44] 06SRE, 10envoy, 06serviceops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Apply Envoy updates to wcqs and wdqs hosts - https://phabricator.wikimedia.org/T403738#11151674 (10Gehel) 05Open→03Resolved [12:20:31] (03Merged) 10jenkins-bot: cirrus-streaming-updater: use new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185075 (https://phabricator.wikimedia.org/T372912) (owner: 10DCausse) [12:21:39] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:23:13] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:24:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:25:36] (03CR) 10Vgutierrez: P:cache:haproxy prevent download of datacenter.mmdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [12:27:26] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 13.1 point update - https://phabricator.wikimedia.org/T403815 (10MoritzMuehlenhoff) 03NEW [12:28:04] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 13.1 point update - https://phabricator.wikimedia.org/T403815#11151718 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:28:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host durum3006.esams.wmnet with OS bookworm [12:28:26] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11151725 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host durum3006.esams.wmnet with OS bookworm [12:29:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:31:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [12:33:45] FIRING: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [12:33:59] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [alerts] - 10https://gerrit.wikimedia.org/r/1182848 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli) [12:36:01] (03CR) 10Btullis: [C:03+1] "LGTM. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [12:36:23] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: Add opensearch-ipoid namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184554 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [12:36:28] btullis@cumin1003 reimage (PID 1140447) is awaiting input [12:36:45] FIRING: [3x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [12:37:45] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [12:38:57] FIRING: SLOMetricAbsent: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:45:35] (03PS4) 10Cory Massaro: Increase max recursion depth in the orchestrator's composition language. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184766 (https://phabricator.wikimedia.org/T403954) [12:46:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:47:20] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:48:21] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3006.esams.wmnet with reason: host reimage [12:48:57] (03PS1) 10Btullis: Bump the size of the java heap for the HDFS namenodes [puppet] - 10https://gerrit.wikimedia.org/r/1185082 (https://phabricator.wikimedia.org/T342587) [12:50:08] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6866/co" [puppet] - 10https://gerrit.wikimedia.org/r/1185082 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis) [12:51:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11151819 (10elukey) I tested the cookbook with newer nodes without the root account manually set up, and this is the result (consistent between hosts): ` Updat... [12:52:40] (03PS6) 10Slyngshede: P:cache:haproxy prevent download of datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) [12:52:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3006.esams.wmnet with reason: host reimage [12:53:30] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:53:38] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:53:46] (03CR) 10Slyngshede: P:cache:haproxy prevent download of datacenter.mmdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [12:54:54] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:55:04] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:56:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 3.320 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:58:12] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:58:39] FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1081-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:59:19] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:59:28] (03CR) 10Vgutierrez: [C:03+1] P:cache:haproxy prevent download of datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [13:02:36] (03CR) 10Slyngshede: [C:03+2] P:cache:haproxy prevent download of datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1185067 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [13:03:39] FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1081-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:03:57] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:04:11] (03PS7) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) [13:04:21] (03CR) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) (owner: 10Effie Mouzeli) [13:04:39] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:04:44] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:06:41] (03CR) 10Effie Mouzeli: mediawiki-dumps-legacy: Use in-pod mcrouter container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) (owner: 10Effie Mouzeli) [13:08:45] FIRING: [2x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [13:09:00] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:09:39] RESOLVED: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [13:10:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum3006.esams.wmnet with OS bookworm [13:10:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum3006.esams.wmnet [13:10:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11151868 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host durum3006.esams.wmnet with OS bookworm completed: - durum300... [13:11:05] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:11:45] RESOLVED: [2x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [13:11:50] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [13:12:23] (03PS7) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:12:25] (03CR) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API (036 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:12:45] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:13:45] RESOLVED: [2x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [13:14:29] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 53066 [13:14:48] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 53066 [13:15:14] (03CR) 10Alexandros Kosiaris: [C:03+1] mediawiki: Update to configuration_1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184893 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [13:15:41] (03PS1) 10Slyngshede: P:cache::haproxy guard datacenter database with if [puppet] - 10https://gerrit.wikimedia.org/r/1185090 (https://phabricator.wikimedia.org/T403616) [13:16:44] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6867/console" [puppet] - 10https://gerrit.wikimedia.org/r/1185090 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [13:17:12] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1233.eqiad.wmnet with OS bullseye [13:17:37] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1233.eqiad.wmnet with OS bullseye [13:18:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [13:18:50] FIRING: [13x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1081-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:20:52] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [13:21:30] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [13:21:38] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [13:21:40] (03Abandoned) 10Kosta Harlan: Revert "hCaptcha: Update secure enclave API endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184969 (owner: 10Kosta Harlan) [13:21:58] (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy guard datacenter database with if [puppet] - 10https://gerrit.wikimedia.org/r/1185090 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [13:22:45] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:22:52] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:cache::haproxy guard datacenter database with if [puppet] - 10https://gerrit.wikimedia.org/r/1185090 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [13:23:19] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [13:23:39] FIRING: [12x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1075-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:25:18] (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) (owner: 10Effie Mouzeli) [13:26:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2147.codfw.wmnet with reason: Maintenance [13:26:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T401906)', diff saved to https://phabricator.wikimedia.org/P82614 and previous config saved to /var/cache/conftool/dbconfig/20250905-132632-fceratto.json [13:26:36] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:26:55] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: Use in-pod mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184878 (https://phabricator.wikimedia.org/T401425) (owner: 10Effie Mouzeli) [13:27:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [13:27:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [13:28:39] FIRING: [15x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1072-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:28:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T401906)', diff saved to https://phabricator.wikimedia.org/P82615 and previous config saved to /var/cache/conftool/dbconfig/20250905-132842-fceratto.json [13:29:16] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:29:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:30:26] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [13:30:31] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [13:33:39] FIRING: [16x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1072-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:34:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.118 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:37:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3006.esams.wmnet [13:37:06] (03PS1) 10Muehlenhoff: Make durum3006 a durum node [puppet] - 10https://gerrit.wikimedia.org/r/1185094 (https://phabricator.wikimedia.org/T402259) [13:37:08] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:38:38] (03CR) 10Ssingh: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1185094 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [13:38:39] FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1096-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:39:24] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11151984 (10ABran-WMF) mailman-web has been restarted, it seems to be a bit faster now [13:39:53] (03CR) 10Ssingh: [C:03+1] "Looks good, thanks for the fix-up." [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [13:40:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11151990 (10elukey) Something that I noticed is that the Etag returned for the root account on ms-be1081 and cp2040 (idrac 9 nodes) is something like `'ETag': '... [13:41:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:43:35] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3006.esams.wmnet - jmm@cumin2002" [13:43:39] FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1083-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:43:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P82616 and previous config saved to /var/cache/conftool/dbconfig/20250905-134350-fceratto.json [13:44:03] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:44:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3006.esams.wmnet - jmm@cumin2002" [13:44:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:44:12] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3006.esams.wmnet on all recursors [13:44:14] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:44:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3006.esams.wmnet on all recursors [13:44:31] (03PS1) 10Dreamy Jazz: hCaptcha: Fix secure enclave implementation [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) [13:44:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3006.esams.wmnet - jmm@cumin2002" [13:44:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3006.esams.wmnet - jmm@cumin2002" [13:45:13] (03CR) 10Dreamy Jazz: [C:03+1] "Secure enclave mode is broken without this, so I think we should backport (even thought it's Friday)" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [13:45:23] (03CR) 10Ssingh: [C:03+1] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [13:46:26] (03CR) 10Ssingh: [C:03+1] varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [13:46:49] jouncebot: nowandnext [13:46:49] For the next 17 hour(s) and 13 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250905T0700) [13:46:49] In 17 hour(s) and 13 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250906T0700) [13:47:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [13:47:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3006.esams.wmnet with OS bookworm [13:47:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11152040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ncredir3006.esams.wmnet with OS bookworm [13:48:39] RESOLVED: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1083-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:49:15] (03CR) 10Ayounsi: [C:03+1] Make durum3006 a durum node [puppet] - 10https://gerrit.wikimedia.org/r/1185094 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [13:49:37] (03PS1) 10Wikipedia₹123: Any Updates [software] - 10https://gerrit.wikimedia.org/r/1185097 [13:51:06] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:51:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:51:53] (03PS1) 10Superpes15: Fix alphabetical order [dns] - 10https://gerrit.wikimedia.org/r/1185098 [13:51:54] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:52:14] (03PS1) 10Brouberol: airflow: inject a domain label on the exported metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185099 (https://phabricator.wikimedia.org/T403701) [13:53:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [13:53:50] FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1083-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:55:32] (03CR) 10CI reject: [V:04-1] hCaptcha: Fix secure enclave implementation [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [13:56:10] (03CR) 10Ssingh: "Thanks for the patch, one comment below." [dns] - 10https://gerrit.wikimedia.org/r/1185098 (owner: 10Superpes15) [13:56:27] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:56:54] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:57:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [13:57:25] (03CR) 10Kosta Harlan: "recheck" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [13:58:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:58:22] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11152051 (10MoritzMuehlenhoff) [13:58:39] RESOLVED: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [13:58:50] FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1083-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:58:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P82617 and previous config saved to /var/cache/conftool/dbconfig/20250905-135857-fceratto.json [13:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:03:36] (03CR) 10Krinkle: varnish: factor out unified_mobile_domain_regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [14:03:37] (03CR) 10Bking: [C:03+1] team-ml: Add alert for outdated admin_ng config [alerts] - 10https://gerrit.wikimedia.org/r/1182531 (https://phabricator.wikimedia.org/T403047) (owner: 10Klausman) [14:03:38] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11152072 (10MoritzMuehlenhoff) The new maps servers initially used RAID10, but with the 4x940G drives in the new servers we would have ended up with a little over 1.4T on /srv, which... [14:03:39] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1083-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:04:02] !log sudo cumin "A:cp" "disable-puppet 'merging CR 1184886-1184126-1184130'":T403510 [14:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:05] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [14:04:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [14:04:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:05:11] (03PS2) 10Superpes15: wikimedia.org: Fix alphabetical order [dns] - 10https://gerrit.wikimedia.org/r/1185098 [14:05:52] andrew@cumin2002 reimage (PID 2838778) is awaiting input [14:06:34] 06SRE, 06Traffic: NDA/Volunteer Agreement for CMU research collabration with SRE/Traffic - https://phabricator.wikimedia.org/T403825 (10ssingh) 03NEW [14:06:37] (03CR) 10Ssingh: [C:03+2] varnish: factor out unified_mobile_domain_regex [puppet] - 10https://gerrit.wikimedia.org/r/1184886 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [14:06:42] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1052.eqiad.wmnet with OS bullseye [14:07:32] (03CR) 10Superpes15: wikimedia.org: Fix alphabetical order (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1185098 (owner: 10Superpes15) [14:07:38] (03CR) 10Wikipedia₹123: "#Resolved" [dns] - 10https://gerrit.wikimedia.org/r/1185098 (owner: 10Superpes15) [14:07:58] (03CR) 10Scott French: [C:03+1] mw-videoscaler: Upgrade to envoy 1.26.8 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184918 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [14:08:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3006.esams.wmnet with reason: host reimage [14:08:37] !log enabling puppet on cp3068: testing CR 1184886 T401595 [14:08:39] FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1092-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:40] T401595: [Rollout Phase 1] Implement unified mobile routing and enable on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T401595 [14:08:56] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:09:20] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:09:36] (03CR) 10Scott French: [C:03+1] mediawiki: Update to configuration_1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184893 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [14:09:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:09:45] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1185098 (owner: 10Superpes15) [14:09:57] (03CR) 10Kosta Harlan: [C:03+2] hCaptcha: Fix secure enclave implementation [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [14:10:02] btullis@cumin1003 reimage (PID 1147539) is awaiting input [14:10:38] (03CR) 10Ssingh: [C:03+2] varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [14:10:44] (03Abandoned) 10Majavah: Any Updates [software] - 10https://gerrit.wikimedia.org/r/1185097 (owner: 10Wikipedia₹123) [14:11:03] (03PS10) 10Ssingh: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [14:11:13] (03CR) 10Ssingh: "Rebased on production, no code change." [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [14:11:16] (03CR) 10CI reject: [V:04-1] hCaptcha: Fix secure enclave implementation [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [14:11:37] (03CR) 10Scott French: "Thanks, Hugh!" [alerts] - 10https://gerrit.wikimedia.org/r/1184957 (owner: 10Scott French) [14:11:42] (03CR) 10Scott French: [C:03+2] tests: provide a user-agent in test_runbook_exists [alerts] - 10https://gerrit.wikimedia.org/r/1184957 (owner: 10Scott French) [14:11:55] (03CR) 10Ssingh: [C:03+2] wikimedia.org: Fix alphabetical order [dns] - 10https://gerrit.wikimedia.org/r/1185098 (owner: 10Superpes15) [14:12:02] !log sukhe@dns1004 START - running authdns-update [14:12:15] (03CR) 10Ssingh: [C:03+2] "Thanks for the patch, merged!" [dns] - 10https://gerrit.wikimedia.org/r/1185098 (owner: 10Superpes15) [14:12:29] (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [14:12:43] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:12:47] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:13:05] !log sukhe@dns1004 END - running authdns-update [14:13:06] (03PS1) 10Muehlenhoff: imposm: Drop quiet from start flags [puppet] - 10https://gerrit.wikimedia.org/r/1185104 [14:13:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3006.esams.wmnet with reason: host reimage [14:13:39] FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1092-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:14:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T401906)', diff saved to https://phabricator.wikimedia.org/P82619 and previous config saved to /var/cache/conftool/dbconfig/20250905-141404-fceratto.json [14:14:08] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:14:15] (03Merged) 10jenkins-bot: tests: provide a user-agent in test_runbook_exists [alerts] - 10https://gerrit.wikimedia.org/r/1184957 (owner: 10Scott French) [14:14:20] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [14:14:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T401906)', diff saved to https://phabricator.wikimedia.org/P82620 and previous config saved to /var/cache/conftool/dbconfig/20250905-141427-fceratto.json [14:14:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:14:38] (03CR) 10Kosta Harlan: [C:03+2] "recheck" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [14:14:39] RESOLVED: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [14:14:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [14:14:45] CirrusSearch consumer-search@codfw is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=codfw&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [14:15:21] (03CR) 10Ssingh: [C:03+2] varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [14:15:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T401906)', diff saved to https://phabricator.wikimedia.org/P82621 and previous config saved to /var/cache/conftool/dbconfig/20250905-141537-fceratto.json [14:16:14] (03CR) 10Elukey: [C:03+1] imposm: Drop quiet from start flags [puppet] - 10https://gerrit.wikimedia.org/r/1185104 (owner: 10Muehlenhoff) [14:18:39] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:19:32] (03CR) 10Krinkle: [C:03+1] beta: Remove replica instance from wmgMainStashServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184937 (https://phabricator.wikimedia.org/T401227) (owner: 10BryanDavis) [14:19:39] (03CR) 10Ssingh: [C:03+2] varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [14:19:56] (03PS6) 10Ssingh: varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [14:20:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [14:20:48] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [14:20:53] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:21:38] (03CR) 10Ssingh: [C:03+2] varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [14:23:39] FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:24:45] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [14:24:45] CirrusSearch consumer-search@codfw is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=codfw&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [14:25:05] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.09.05 - 2025.09.26): Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11152225 (10Gehel) [14:26:04] 07sre-alert-triage, 10Data-Platform-SRE (2025.09.05 - 2025.09.26): Alert in need of triage: PybalBackendDown (instance cirrussearch2091:0) - https://phabricator.wikimedia.org/T399161#11152251 (10Gehel) [14:26:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3006.esams.wmnet with OS bookworm [14:26:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir3006.esams.wmnet [14:28:39] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:29:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T402925)', diff saved to https://phabricator.wikimedia.org/P82622 and previous config saved to /var/cache/conftool/dbconfig/20250905-142921-ladsgroup.json [14:29:27] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [14:30:05] (03PS1) 10Superpes15: [lbwiki] Change to 'uca-lb-u-kn' category collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185105 (https://phabricator.wikimedia.org/T402083) [14:30:06] !log sudo cumin -b31 "A:cp-text" "run-puppet-agent --enable 'merging CR 1184886-1184126-1184130'" [14:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:23] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11152349 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ncredir3006.esams.wmnet with OS bookworm completed: - ncredi... [14:30:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P82623 and previous config saved to /var/cache/conftool/dbconfig/20250905-143045-fceratto.json [14:30:46] (03CR) 10Kosta Harlan: hCaptcha: Fix secure enclave implementation [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [14:30:50] (03CR) 10Kosta Harlan: [C:03+2] hCaptcha: Fix secure enclave implementation [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [14:30:53] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1184117|Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki (T401595)]], [[gerrit:1184131|Disable wmgUseMdotRouting on mediawiki.org (T403510)]] [14:30:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [14:30:57] T401595: [Rollout Phase 1] Implement unified mobile routing and enable on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T401595 [14:30:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184131 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [14:30:58] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [14:31:04] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [14:31:08] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184131 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [14:33:39] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:35:20] !log sudo cumin -b31 "A:cp-upload" "run-puppet-agent --enable 'merging CR 1184886-1184126-1184130'" [14:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:12] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1184117|Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki (T401595)]], [[gerrit:1184131|Disable wmgUseMdotRouting on mediawiki.org (T403510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:36:17] T401595: [Rollout Phase 1] Implement unified mobile routing and enable on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T401595 [14:36:17] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [14:37:36] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1052.eqiad.wmnet with reason: host reimage [14:39:16] !log krinkle@deploy1003 krinkle: Continuing with sync [14:40:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [14:40:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:41:35] (03CR) 10Ssingh: "Adding @vgutierrez@wikimedia.org and @ffurnari@wikimedia.org based on the linked follow up, for their confirmation." [puppet] - 10https://gerrit.wikimedia.org/r/1183274 (https://phabricator.wikimedia.org/T392073) (owner: 10Krinkle) [14:41:38] (03CR) 10CI reject: [V:04-1] hCaptcha: Fix secure enclave implementation [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [14:41:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [14:41:50] CirrusSearch consumer-search@codfw is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=codfw&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [14:42:56] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1052.eqiad.wmnet with reason: host reimage [14:44:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P82624 and previous config saved to /var/cache/conftool/dbconfig/20250905-144429-ladsgroup.json [14:44:34] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184117|Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki (T401595)]], [[gerrit:1184131|Disable wmgUseMdotRouting on mediawiki.org (T403510)]] (duration: 13m 40s) [14:44:38] T401595: [Rollout Phase 1] Implement unified mobile routing and enable on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T401595 [14:44:38] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [14:45:45] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [14:45:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P82625 and previous config saved to /var/cache/conftool/dbconfig/20250905-144552-fceratto.json [14:46:17] (03PS1) 10Muehlenhoff: Add ncredir3006 [puppet] - 10https://gerrit.wikimedia.org/r/1185107 (https://phabricator.wikimedia.org/T402259) [14:47:21] (03PS1) 10Reedy: Drop PageViewInfo's integration with the Graph extension on action=info, the extension is dead [extensions/PageViewInfo] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185108 (https://phabricator.wikimedia.org/T403753) [14:47:29] (03CR) 10Reedy: [C:03+2] Drop PageViewInfo's integration with the Graph extension on action=info, the extension is dead [extensions/PageViewInfo] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185108 (https://phabricator.wikimedia.org/T403753) (owner: 10Reedy) [14:48:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1090-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:48:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [14:48:50] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:49:49] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:51:24] (03CR) 10Vgutierrez: [C:03+1] Add ncredir3006 [puppet] - 10https://gerrit.wikimedia.org/r/1185107 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [14:52:54] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11152416 (10Krinkle) [14:53:39] FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1090-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:53:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [14:53:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11152420 (10Jhancock.wm) [14:54:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:55:39] RESOLVED: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [14:55:45] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [14:56:27] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:56:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:56:44] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:57:45] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:58:39] FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1090-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:59:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P82626 and previous config saved to /var/cache/conftool/dbconfig/20250905-145937-ladsgroup.json [15:00:11] (03Merged) 10jenkins-bot: Drop PageViewInfo's integration with the Graph extension on action=info, the extension is dead [extensions/PageViewInfo] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185108 (https://phabricator.wikimedia.org/T403753) (owner: 10Reedy) [15:00:34] (03CR) 10Reedy: [C:03+2] hCaptcha: Fix secure enclave implementation [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [15:01:22] (03PS1) 10Andrew Bogott: cloudcephosd: update nic names for cloudcephosd1052 [puppet] - 10https://gerrit.wikimedia.org/r/1185110 [15:01:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.224 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:01:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 07Essential-Work: decommission an-worker109[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T401678#11152467 (10Gehel) [15:02:03] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd: update nic names for cloudcephosd1052 [puppet] - 10https://gerrit.wikimedia.org/r/1185110 (owner: 10Andrew Bogott) [15:02:07] 07sre-alert-triage, 10Wikidata, 06Wikidata-Omega, 10Wikidata-Query-Service, 07Essential-Work: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292#11152468 (10Gehel) [15:02:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 07Essential-Work: decommission an-druid100[1-2] - https://phabricator.wikimedia.org/T402814#11152470 (10Gehel) [15:03:39] FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1070-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:04:30] (03CR) 10Brouberol: [C:03+1] Bump the size of the java heap for the HDFS namenodes [puppet] - 10https://gerrit.wikimedia.org/r/1185082 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis) [15:04:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:04:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [15:05:50] 07sre-alert-triage, 07Essential-Work: Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T400968#11152491 (10Gehel) [15:06:21] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1052.eqiad.wmnet with OS bullseye [15:06:40] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11152504 (10Gehel) [15:06:50] (03PS1) 10Papaul: Adding BGP to mr1-eqsin, cr2/3-eqsin to replace OSPF [homer/public] - 10https://gerrit.wikimedia.org/r/1185112 (https://phabricator.wikimedia.org/T294845) [15:07:45] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:07:50] 07sre-alert-triage, 07Essential-Work: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11152510 (10Gehel) [15:08:17] (03CR) 10CI reject: [V:04-1] Adding BGP to mr1-eqsin, cr2/3-eqsin to replace OSPF [homer/public] - 10https://gerrit.wikimedia.org/r/1185112 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [15:08:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:09:53] 10ops-codfw, 06SRE, 06DC-Ops, 07Essential-Work: cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11152526 (10Gehel) [15:11:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 07Essential-Work: Request for Kerberos identity for querying SSAC table via statmachines - https://phabricator.wikimedia.org/T401827#11152540 (10Gehel) [15:11:27] 06SRE, 10envoy, 06serviceops, 07Essential-Work: Apply Envoy updates to wcqs and wdqs hosts - https://phabricator.wikimedia.org/T403738#11152544 (10Gehel) [15:12:59] (03CR) 10Ayounsi: [C:03+1] Adding BGP to mr1-eqsin, cr2/3-eqsin to replace OSPF (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1185112 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [15:13:05] (03PS2) 10Papaul: Adding BGP to mr1-eqsin, cr2/3-eqsin to replace OSPF [homer/public] - 10https://gerrit.wikimedia.org/r/1185112 (https://phabricator.wikimedia.org/T294845) [15:13:20] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: T383173 [15:13:21] (03Merged) 10jenkins-bot: hCaptcha: Fix secure enclave implementation [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [15:13:22] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: Add opensearch-ipoid namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1184551 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [15:13:25] T383173: Supermicro: UEFI HTTP boot request hangs on cold boot - https://phabricator.wikimedia.org/T383173 [15:13:39] (03CR) 10Bking: [C:03+2] "approval implied by discussions here and on 1184554" [puppet] - 10https://gerrit.wikimedia.org/r/1184551 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [15:14:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185096 (https://phabricator.wikimedia.org/T378188) (owner: 10Dreamy Jazz) [15:14:30] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:14:34] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:14:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [15:14:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T402925)', diff saved to https://phabricator.wikimedia.org/P82627 and previous config saved to /var/cache/conftool/dbconfig/20250905-151444-ladsgroup.json [15:14:48] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [15:14:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:14:59] (03CR) 10Btullis: [C:03+1] airflow: inject a domain label on the exported metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185099 (https://phabricator.wikimedia.org/T403701) (owner: 10Brouberol) [15:15:01] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2236.codfw.wmnet with reason: Maintenance [15:15:04] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1184968|hCaptcha: Update secure enclave API endpoint]], [[gerrit:1185096|hCaptcha: Fix secure enclave implementation (T378188)]] [15:15:08] T378188: Implement secure enclave mode for hCaptcha - https://phabricator.wikimedia.org/T378188 [15:15:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2236 (T402925)', diff saved to https://phabricator.wikimedia.org/P82628 and previous config saved to /var/cache/conftool/dbconfig/20250905-151508-ladsgroup.json [15:15:26] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:15:39] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:15:46] (03CR) 10Brouberol: [C:03+2] airflow: inject a domain label on the exported metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185099 (https://phabricator.wikimedia.org/T403701) (owner: 10Brouberol) [15:16:57] (03Abandoned) 10Btullis: MachineVision extension is being sunsetted, so stop doing dumps [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [15:17:15] (03PS1) 10Papaul: Add eqsin privare IPV4 to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1185115 (https://phabricator.wikimedia.org/T294845) [15:18:37] (03CR) 10CI reject: [V:04-1] Add eqsin privare IPV4 to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1185115 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [15:18:39] FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1070-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:19:00] (03CR) 10Ayounsi: [C:03+1] Add eqsin privare IPV4 to prefix-list pops4 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1185115 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [15:19:37] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on cawiki, hewiki, itwiki (group0) [puppet] - 10https://gerrit.wikimedia.org/r/1185116 (https://phabricator.wikimedia.org/T403510) [15:19:39] RESOLVED: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [15:19:48] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [15:19:53] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:20:19] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: Add opensearch-ipoid namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184554 (https://phabricator.wikimedia.org/T403534) (owner: 10Bking) [15:20:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:20:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:21:18] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:21:39] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:22:45] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:23:15] (03CR) 10Ssingh: varnish: Enable unified mobile routing on cawiki, hewiki, itwiki (group0) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1185116 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [15:23:39] RESOLVED: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1070-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:23:57] RESOLVED: SLOMetricAbsent: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:24:18] (03CR) 10Krinkle: varnish: Enable unified mobile routing on cawiki, hewiki, itwiki (group0) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1185116 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [15:25:15] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on cawiki, hewiki, itwiki (group1) [puppet] - 10https://gerrit.wikimedia.org/r/1185116 (https://phabricator.wikimedia.org/T403510) [15:26:23] (03CR) 10Ssingh: [C:03+1] "Do you want to roll this out today as well?" [puppet] - 10https://gerrit.wikimedia.org/r/1185116 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [15:26:42] (03CR) 10Krinkle: "No, this is for next Monday or Tuesday." [puppet] - 10https://gerrit.wikimedia.org/r/1185116 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [15:26:49] (03PS2) 10Papaul: Add eqsin privare IPV4 to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1185115 [15:27:17] (03CR) 10Ssingh: [C:03+1] "OK please ping us when you want to deploy it." [puppet] - 10https://gerrit.wikimedia.org/r/1185116 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [15:27:17] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on cawiki, hewiki, itwiki (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185120 (https://phabricator.wikimedia.org/T403510) [15:27:49] RESOLVED: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:28:25] (03CR) 10CI reject: [V:04-1] Add eqsin privare IPV4 to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1185115 (owner: 10Papaul) [15:29:29] (03PS3) 10Papaul: Add eqsin private IPV4 to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1185115 (https://phabricator.wikimedia.org/T294845) [15:30:55] (03CR) 10CI reject: [V:04-1] Add eqsin private IPV4 to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1185115 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [15:31:25] FIRING: ErrorBudgetBurn: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:31:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1165.eqiad.wmnet with reason: Maintenance [15:31:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:31:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T401906)', diff saved to https://phabricator.wikimedia.org/P82629 and previous config saved to /var/cache/conftool/dbconfig/20250905-153157-fceratto.json [15:32:01] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [15:33:57] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T401906)', diff saved to https://phabricator.wikimedia.org/P82630 and previous config saved to /var/cache/conftool/dbconfig/20250905-153407-fceratto.json [15:42:19] James_F / Reedy: there should be no user visible impact for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageViewInfo/+/1185108, right? (It's about to go live everywhere on wmf.17.) [15:42:23] !log kharlan@deploy1003 dreamyjazz, kharlan: Backport for [[gerrit:1184968|hCaptcha: Update secure enclave API endpoint]], [[gerrit:1185096|hCaptcha: Fix secure enclave implementation (T378188)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:42:27] T378188: Implement secure enclave mode for hCaptcha - https://phabricator.wikimedia.org/T378188 [15:42:32] It's been off for a while [15:42:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [15:42:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:43:03] (03PS4) 10Papaul: Add eqsin private IPV4 to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1185115 [15:45:15] (03CR) 10CI reject: [V:04-1] Add eqsin private IPV4 to prefix-list pops4 [homer/public] - 10https://gerrit.wikimedia.org/r/1185115 (owner: 10Papaul) [15:47:46] FIRING: Traffic on tunnel link: Alert for device cr4-ulsfo.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [15:47:51] (03PS1) 10Dzahn: zuul::executor: add parameter for port and set it to 7100 [puppet] - 10https://gerrit.wikimedia.org/r/1184924 (https://phabricator.wikimedia.org/T395938) [15:47:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:48:10] kostajh: Yes, it's just removing code that's already been dead for years. [15:48:36] (03CR) 10Dzahn: [C:03+2] "we are going to use this in a template - not sure about the details yet but can be sure we will need this.. going ahead because I prefer s" [puppet] - 10https://gerrit.wikimedia.org/r/1184924 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [15:49:02] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:49:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P82631 and previous config saved to /var/cache/conftool/dbconfig/20250905-154914-fceratto.json [15:51:26] thanks. [15:52:41] (03CR) 10Dzahn: "the "check experimental" build failed because there are no Host: headers and then it tries to compile it on literally every host.. which t" [puppet] - 10https://gerrit.wikimedia.org/r/1181198 (https://phabricator.wikimedia.org/T390119) (owner: 10Dduvall) [15:55:01] (03CR) 10Dzahn: "building via https://integration.wikimedia.org/ci/view/Ops/job/operations-puppet-catalog-compiler/build?delay=0sec and using "C:gitlab::ru" [puppet] - 10https://gerrit.wikimedia.org/r/1181198 (https://phabricator.wikimedia.org/T390119) (owner: 10Dduvall) [15:55:18] 06SRE, 10MediaWiki-Action-API, 06MW-Interfaces-Team, 07Wikimedia-production-error: Frequent HTTP 503 errors from MediaWiki API every 1 or 2 minutes - https://phabricator.wikimedia.org/T390438#11152774 (10akosiaris) 05Open→03Resolved a:03akosiaris >>! In T390438#10804169, @Magog_the_Ogre wrote: >... [15:55:49] how long do I have to say "yes/no" to "Continue with sync?" ? [15:57:02] (03CR) 10Dzahn: "buildkitd.gitlab-runner is removed from the "NO_PROXY_ env variable in /etc/default/buildkitd" [puppet] - 10https://gerrit.wikimedia.org/r/1181198 (https://phabricator.wikimedia.org/T390119) (owner: 10Dduvall) [15:57:30] kostajh: I think it times out eventually. 30 mins? [15:57:38] ok [15:57:44] Scap has no interaction timeouts. [15:57:52] Doesn't the lock time out? [15:57:59] No. [15:58:02] A wait for a lock will timeout [15:58:03] (03CR) 10Dzahn: "" "BUILDKIT_HOST=tcp://buildkitd.gitlab-runner:1234"," is removed from /home/gitlab-runner/.gitlab-runner/managed.toml" [puppet] - 10https://gerrit.wikimedia.org/r/1181198 (https://phabricator.wikimedia.org/T390119) (owner: 10Dduvall) [15:58:09] Oh, interesting. [15:59:19] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:00:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [16:00:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [16:00:58] good to know, thanks [16:01:19] I should be moving this deploy forward soon, anyway. [16:01:48] !log kharlan@deploy1003 dreamyjazz, kharlan: Continuing with sync [16:04:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P82632 and previous config saved to /var/cache/conftool/dbconfig/20250905-160422-fceratto.json [16:06:24] FIRING: [2x] ErrorBudgetBurn: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:07:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [16:07:46] RESOLVED: Traffic on tunnel link: Alert for device cr4-ulsfo.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [16:11:45] FIRING: [2x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [16:13:49] (03PS1) 10Umherirrender: build: Update .phpcs.xml for array-type property [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185156 (https://phabricator.wikimedia.org/T403781) [16:16:07] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184968|hCaptcha: Update secure enclave API endpoint]], [[gerrit:1185096|hCaptcha: Fix secure enclave implementation (T378188)]] (duration: 61m 02s) [16:16:11] T378188: Implement secure enclave mode for hCaptcha - https://phabricator.wikimedia.org/T378188 [16:17:00] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11152860 (10Krinkle) [16:17:47] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11152865 (10Krinkle) [16:19:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T401906)', diff saved to https://phabricator.wikimedia.org/P82633 and previous config saved to /var/cache/conftool/dbconfig/20250905-161929-fceratto.json [16:19:34] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [16:19:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:19:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1168.eqiad.wmnet with reason: Maintenance [16:19:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T401906)', diff saved to https://phabricator.wikimedia.org/P82634 and previous config saved to /var/cache/conftool/dbconfig/20250905-161952-fceratto.json [16:21:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:21:45] RESOLVED: [2x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [16:22:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T401906)', diff saved to https://phabricator.wikimedia.org/P82635 and previous config saved to /var/cache/conftool/dbconfig/20250905-162202-fceratto.json [16:24:26] (03PS1) 10Catrope: Fix display of Codex message icons II [skins/Vector] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185159 (https://phabricator.wikimedia.org/T401457) [16:25:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [skins/Vector] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185159 (https://phabricator.wikimedia.org/T401457) (owner: 10Catrope) [16:28:30] FIRING: [3x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [16:29:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.815 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:37:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P82636 and previous config saved to /var/cache/conftool/dbconfig/20250905-163709-fceratto.json [16:38:30] FIRING: [2x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [16:38:31] !log dzahn@cumin2002 START - Cookbook sre.dns.admin DNS admin: depool site ulsfo [reason: no reason specified, ] [16:38:42] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site ulsfo [reason: no reason specified, ] [16:39:22] !log depooling ulsfo (fiber cut) [16:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.161 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:52:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P82637 and previous config saved to /var/cache/conftool/dbconfig/20250905-165217-fceratto.json [16:52:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [16:53:30] FIRING: [2x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [16:58:30] RESOLVED: [2x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [16:58:48] (03PS11) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [17:03:57] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:05:29] (03CR) 10Herron: [C:03+1] nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [17:05:43] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [17:06:15] (03Abandoned) 10Herron: pyrra::isito: add revision label [puppet] - 10https://gerrit.wikimedia.org/r/1175564 (owner: 10Herron) [17:07:05] (03PS12) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [17:07:18] (03Abandoned) 10Herron: thanos: clean citoid SLO recording rule history [puppet] - 10https://gerrit.wikimedia.org/r/1176484 (https://phabricator.wikimedia.org/T400073) (owner: 10Herron) [17:07:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T401906)', diff saved to https://phabricator.wikimedia.org/P82638 and previous config saved to /var/cache/conftool/dbconfig/20250905-170725-fceratto.json [17:07:29] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [17:07:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1173.eqiad.wmnet with reason: Maintenance [17:07:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T401906)', diff saved to https://phabricator.wikimedia.org/P82639 and previous config saved to /var/cache/conftool/dbconfig/20250905-170747-fceratto.json [17:09:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T401906)', diff saved to https://phabricator.wikimedia.org/P82640 and previous config saved to /var/cache/conftool/dbconfig/20250905-170956-fceratto.json [17:12:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:14:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:15:05] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [17:16:31] (03PS13) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [17:16:46] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on sretest2001.codfw.wmnet with reason: T383173 [17:16:49] T383173: Supermicro: UEFI HTTP boot request hangs on cold boot - https://phabricator.wikimedia.org/T383173 [17:19:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.133 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:22:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [17:22:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:25:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P82641 and previous config saved to /var/cache/conftool/dbconfig/20250905-172504-fceratto.json [17:32:40] (03PS14) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [17:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:33:58] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:34] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:37:49] (03PS1) 10Clare Ming: xLab: Deploy v1.0.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185172 (https://phabricator.wikimedia.org/T371225) [17:38:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:40:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P82642 and previous config saved to /var/cache/conftool/dbconfig/20250905-174011-fceratto.json [17:41:16] (03PS2) 10Clare Ming: xLab: Deploy v1.0.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185172 (https://phabricator.wikimedia.org/T371225) [17:41:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 6.893 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:42:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:42:51] (03PS1) 10Dzahn: zuul: define main and executor host names in common hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1185174 (https://phabricator.wikimedia.org/T403847) [17:43:27] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1185175 [17:43:53] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1185175 (owner: 10CDanis) [17:44:25] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v1.0.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185172 (https://phabricator.wikimedia.org/T371225) (owner: 10Clare Ming) [17:45:58] (03Merged) 10jenkins-bot: xLab: Deploy v1.0.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185172 (https://phabricator.wikimedia.org/T371225) (owner: 10Clare Ming) [17:47:57] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [17:48:01] (03PS2) 10CDanis: hcaptcha: eschew newassets, use js instead [puppet] - 10https://gerrit.wikimedia.org/r/1185175 (https://phabricator.wikimedia.org/T378188) [17:48:28] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [17:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:49:35] (03CR) 10Dzahn: [C:03+2] zuul: define main and executor host names in common hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1185174 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [17:49:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [17:49:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:50:51] (03CR) 10RLazarus: [C:03+1] hcaptcha: eschew newassets, use js instead [puppet] - 10https://gerrit.wikimedia.org/r/1185175 (https://phabricator.wikimedia.org/T378188) (owner: 10CDanis) [17:51:07] (03CR) 10CDanis: [C:03+2] hcaptcha: eschew newassets, use js instead [puppet] - 10https://gerrit.wikimedia.org/r/1185175 (https://phabricator.wikimedia.org/T378188) (owner: 10CDanis) [17:51:07] (03CR) 10Kosta Harlan: [C:03+1] hcaptcha: eschew newassets, use js instead [puppet] - 10https://gerrit.wikimedia.org/r/1185175 (https://phabricator.wikimedia.org/T378188) (owner: 10CDanis) [17:52:58] (03PS15) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [17:54:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:55:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T401906)', diff saved to https://phabricator.wikimedia.org/P82643 and previous config saved to /var/cache/conftool/dbconfig/20250905-175519-fceratto.json [17:55:23] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [17:55:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1180.eqiad.wmnet with reason: Maintenance [17:55:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T401906)', diff saved to https://phabricator.wikimedia.org/P82644 and previous config saved to /var/cache/conftool/dbconfig/20250905-175541-fceratto.json [17:56:39] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11153184 (10Jclark-ctr) a:05Jclark-ctr→03bking [17:56:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T401906)', diff saved to https://phabricator.wikimedia.org/P82645 and previous config saved to /var/cache/conftool/dbconfig/20250905-175651-fceratto.json [17:57:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T402925)', diff saved to https://phabricator.wikimedia.org/P82646 and previous config saved to /var/cache/conftool/dbconfig/20250905-175700-ladsgroup.json [17:57:04] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [17:59:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:01:25] FIRING: [2x] ErrorBudgetBurn: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:04:43] (03CR) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:05:41] (03PS1) 10Dzahn: zuul::executor: firewall rule to allow main nodes to zuul-web port [puppet] - 10https://gerrit.wikimedia.org/r/1185180 (https://phabricator.wikimedia.org/T403847) [18:05:56] (03CR) 10CI reject: [V:04-1] zuul::executor: firewall rule to allow main nodes to zuul-web port [puppet] - 10https://gerrit.wikimedia.org/r/1185180 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [18:06:33] (03PS2) 10Dzahn: zuul::executor: firewall rule to allow main nodes to zuul-web port [puppet] - 10https://gerrit.wikimedia.org/r/1185180 (https://phabricator.wikimedia.org/T403847) [18:09:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:10:06] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1185180/6871/zuul1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1185180 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [18:10:58] (03CR) 10Dzahn: [V:03+1 C:03+2] zuul::executor: firewall rule to allow main nodes to zuul-web port [puppet] - 10https://gerrit.wikimedia.org/r/1185180 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [18:11:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:11:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P82647 and previous config saved to /var/cache/conftool/dbconfig/20250905-181158-fceratto.json [18:12:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P82648 and previous config saved to /var/cache/conftool/dbconfig/20250905-181207-ladsgroup.json [18:12:38] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11153283 (10Krinkle) [18:16:39] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852 (10MoritzMuehlenhoff) 03NEW [18:16:46] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11153316 (10MoritzMuehlenhoff) p:05Triage→03Medium [18:16:52] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11153301 (10Krinkle) [18:16:58] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.1 point update - https://phabricator.wikimedia.org/T403815#11153317 (10MoritzMuehlenhoff) [18:24:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [18:26:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 6.673 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:27:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P82649 and previous config saved to /var/cache/conftool/dbconfig/20250905-182705-fceratto.json [18:27:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P82650 and previous config saved to /var/cache/conftool/dbconfig/20250905-182715-ladsgroup.json [18:31:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#11153347 (10Jclark-ctr) 05Resolved→03Open @VRiley-WMF elastic1067 is still listed in Netbox in rack please Verify and update [18:33:57] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:39:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:41:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:42:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T401906)', diff saved to https://phabricator.wikimedia.org/P82651 and previous config saved to /var/cache/conftool/dbconfig/20250905-184213-fceratto.json [18:42:18] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [18:42:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T402925)', diff saved to https://phabricator.wikimedia.org/P82652 and previous config saved to /var/cache/conftool/dbconfig/20250905-184222-ladsgroup.json [18:42:26] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [18:42:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [18:42:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T401906)', diff saved to https://phabricator.wikimedia.org/P82653 and previous config saved to /var/cache/conftool/dbconfig/20250905-184236-fceratto.json [18:42:38] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2237.codfw.wmnet with reason: Maintenance [18:42:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2237 (T402925)', diff saved to https://phabricator.wikimedia.org/P82654 and previous config saved to /var/cache/conftool/dbconfig/20250905-184245-ladsgroup.json [18:44:30] 10ops-eqiad, 06SRE, 06DC-Ops: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#11153396 (10Jclark-ctr) @VRiley-WMF These are still listed as in rack in netbox and decom status please update and fix [18:44:42] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.773 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:44:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [18:44:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [18:44:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T401906)', diff saved to https://phabricator.wikimedia.org/P82655 and previous config saved to /var/cache/conftool/dbconfig/20250905-184445-fceratto.json [18:51:30] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 5.531 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:51:51] (03PS1) 10Dzahn: zuul::executor: use profile::docker::engine to install docker [puppet] - 10https://gerrit.wikimedia.org/r/1185184 (https://phabricator.wikimedia.org/T403847) [18:52:10] (03PS2) 10Dzahn: zuul::executor: use profile::docker::engine to install docker [puppet] - 10https://gerrit.wikimedia.org/r/1185184 (https://phabricator.wikimedia.org/T403847) [18:52:54] 10ops-eqiad, 06SRE, 06DC-Ops: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#11153429 (10Jclark-ctr) 05Resolved→03Open [18:57:37] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T400442#11153435 (10jasmine_) [18:59:17] (03CR) 10Dzahn: [C:03+2] zuul::executor: use profile::docker::engine to install docker [puppet] - 10https://gerrit.wikimedia.org/r/1185184 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [18:59:23] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T400442#11153441 (10wiki_willy) Awesome, thanks so much @jasmine_ ! [18:59:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:59:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P82656 and previous config saved to /var/cache/conftool/dbconfig/20250905-185952-fceratto.json [19:05:16] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mwmaint2002.codfw.wmnet - https://phabricator.wikimedia.org/T403855 (10jasmine_) 03NEW [19:05:41] (03CR) 10Cwhite: [C:03+1] nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [19:05:55] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mwmaint2002.codfw.wmnet - https://phabricator.wikimedia.org/T403855#11153466 (10wiki_willy) Thanks @jasmine_ ! [19:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:09:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:10:07] (03CR) 10Dzahn: "@Filippo You were added by the bot due to the rule on https://www.mediawiki.org/wiki/Git/Reviewers#operations/puppet You might want to ed" [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [19:13:04] (03CR) 10Dzahn: "someone needs to merge it, uploader won't be able to" [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [19:13:31] (03CR) 10Dzahn: [C:03+2] nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [19:15:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P82657 and previous config saved to /var/cache/conftool/dbconfig/20250905-191500-fceratto.json [19:19:31] !log dzahn@cumin2002 START - Cookbook sre.dns.admin DNS admin: pool site ulsfo [reason: no reason specified, ] [19:19:36] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site ulsfo [reason: no reason specified, ] [19:20:47] !log pooled ulsfo again - Lumen back up - Arelion still working [19:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:56] PROBLEM - Check correctness of the icinga configuration on alert1002 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [19:24:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:24:49] (03CR) 10Dzahn: [C:03+2] "<+icinga-wm> PROBLEM - Check correctness of the icinga configuration on alert1002 is CRITICAL: Icinga configuration contains errors" [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [19:25:28] (03CR) 10Dzahn: [C:03+2] "Error: Could not find any host matching 'frdata2001' (config file '/etc/icinga/objects/nsca_frack.cfg', starting on line 938)" [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [19:25:48] (03PS1) 10Dzahn: Revert "nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001" [puppet] - 10https://gerrit.wikimedia.org/r/1185188 [19:27:49] (03CR) 10Dzahn: [C:03+2] "there are still a bunch of services tied to the host that is removed here, leading to icinga config errors.. reverting" [puppet] - 10https://gerrit.wikimedia.org/r/1184796 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [19:28:49] (03CR) 10Dzahn: "starting line 912 there are multiple services tied to the removed host" [puppet] - 10https://gerrit.wikimedia.org/r/1185188 (owner: 10Dzahn) [19:28:55] (03CR) 10Dzahn: [C:03+2] Revert "nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001" [puppet] - 10https://gerrit.wikimedia.org/r/1185188 (owner: 10Dzahn) [19:29:42] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.584 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:30:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T401906)', diff saved to https://phabricator.wikimedia.org/P82658 and previous config saved to /var/cache/conftool/dbconfig/20250905-193007-fceratto.json [19:30:12] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [19:30:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance [19:30:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1231.eqiad.wmnet with reason: Maintenance [19:30:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T401906)', diff saved to https://phabricator.wikimedia.org/P82659 and previous config saved to /var/cache/conftool/dbconfig/20250905-193047-fceratto.json [19:32:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [19:32:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T401906)', diff saved to https://phabricator.wikimedia.org/P82660 and previous config saved to /var/cache/conftool/dbconfig/20250905-193256-fceratto.json [19:34:52] (03CR) 10Dzahn: [C:03+2] "icinga config has no more errors." [puppet] - 10https://gerrit.wikimedia.org/r/1185188 (owner: 10Dzahn) [19:42:56] RECOVERY - Check correctness of the icinga configuration on alert1002 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [19:45:00] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Fri 03 Oct 2025 07:09:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [19:48:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P82661 and previous config saved to /var/cache/conftool/dbconfig/20250905-194804-fceratto.json [19:52:12] FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [19:54:47] (03PS2) 10DLynch: Revert "Edit: Split footer lists into columns" [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185192 (https://phabricator.wikimedia.org/T401066) [19:56:45] (03CR) 10Bartosz Dziewoński: [C:03+1] Revert "Edit: Split footer lists into columns" [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185192 (https://phabricator.wikimedia.org/T401066) (owner: 10DLynch) [19:57:13] (03PS1) 10Jgreen: Redo nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/1185193 (https://phabricator.wikimedia.org/T403674) [19:58:32] I need an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1185192 -- context is T401066, are SRE okay with a deployment? (cc: thcipriani dancy andre). I already have someone to deploy. [19:58:33] T401066: List of templates used should be presented in multi-column format - https://phabricator.wikimedia.org/T401066 [19:58:49] Hm, better context would have been T403856. [19:58:50] T403856: Searching while previewing an edit hangs Chrome tabs - https://phabricator.wikimedia.org/T403856 [19:59:22] (03PS1) 10Dzahn: zuul::executor: ensure /var/lib/zuul dir exists [puppet] - 10https://gerrit.wikimedia.org/r/1185196 (https://phabricator.wikimedia.org/T403847) [19:59:43] (03CR) 10CI reject: [V:04-1] zuul::executor: ensure /var/lib/zuul dir exists [puppet] - 10https://gerrit.wikimedia.org/r/1185196 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [19:59:44] Kemayo: That looks like a reasonable thing to emergency deploy [20:00:03] (03PS2) 10Dzahn: zuul::executor: ensure /var/lib/zuul dir exists [puppet] - 10https://gerrit.wikimedia.org/r/1185196 (https://phabricator.wikimedia.org/T403847) [20:00:20] dancy: Great, I will get it out then. [20:00:32] +1 good luck Kemayo [20:01:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185192 (https://phabricator.wikimedia.org/T401066) (owner: 10DLynch) [20:01:45] Kemayo: please do get an OK from both SRE and releng when you need to do this :) but yes go ahead [20:02:45] (03CR) 10Dzahn: [C:03+2] zuul::executor: ensure /var/lib/zuul dir exists [puppet] - 10https://gerrit.wikimedia.org/r/1185196 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [20:03:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P82662 and previous config saved to /var/cache/conftool/dbconfig/20250905-200311-fceratto.json [20:03:57] rzl: I guess maybe the template on the emergency deploys page needs an update to make sure all the right people get pinged? [20:04:29] it does say "Get positive confirmation from SRE before deployment by messaging the SRE's listed as SREs on call in #wikimedia-operations" but possibly not in a sufficiently visible way :) [20:05:03] the template is fine, it's just that getting an OK from one of the two teams involved doesn't mean you're ready to go [20:05:09] oh, I updated that text not too long ago, but not the IRC template [20:05:26] I figured the template-message should cover everything, and didn't consider that it didn't actually include the SREs on call. [20:05:28] * thcipriani updates [20:08:10] template updated, thanks for flagging that Kemayo and thanks for surfacing that rzl [20:09:11] thanks thcipriani <3 [20:09:35] Updates appreciated! [20:16:25] RESOLVED: ErrorBudgetBurn: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:16:58] (03Merged) 10jenkins-bot: Revert "Edit: Split footer lists into columns" [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1185192 (https://phabricator.wikimedia.org/T401066) (owner: 10DLynch) [20:17:17] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1185192|Revert "Edit: Split footer lists into columns" (T401066 T403856)]] [20:17:22] T401066: List of templates used should be presented in multi-column format - https://phabricator.wikimedia.org/T401066 [20:17:22] T403856: Searching while previewing an edit hangs Chrome tabs - https://phabricator.wikimedia.org/T403856 [20:18:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T401906)', diff saved to https://phabricator.wikimedia.org/P82663 and previous config saved to /var/cache/conftool/dbconfig/20250905-201818-fceratto.json [20:18:22] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [20:18:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [20:23:18] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1185192|Revert "Edit: Split footer lists into columns" (T401066 T403856)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:23:23] T401066: List of templates used should be presented in multi-column format - https://phabricator.wikimedia.org/T401066 [20:23:24] T403856: Searching while previewing an edit hangs Chrome tabs - https://phabricator.wikimedia.org/T403856 [20:24:03] (03PS1) 10Dzahn: zuul: load apache mod_proxy_wstunnel, add rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/1185201 (https://phabricator.wikimedia.org/T395938) [20:24:37] !log kemayo@deploy1003 kemayo: Continuing with sync [20:31:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#11153753 (10VRiley-WMF) 05Open→03Resolved @Jclark-ctr verified it was removed and the script has been run [20:32:48] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1185192|Revert "Edit: Split footer lists into columns" (T401066 T403856)]] (duration: 15m 31s) [20:32:54] T401066: List of templates used should be presented in multi-column format - https://phabricator.wikimedia.org/T401066 [20:32:54] T403856: Searching while previewing an edit hangs Chrome tabs - https://phabricator.wikimedia.org/T403856 [20:33:13] All done, and it doesn't seem to have broken anything. 👍🏻 [20:35:58] <3 [20:38:56] (03CR) 10Dzahn: [C:03+2] zuul: load apache mod_proxy_wstunnel, add rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/1185201 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:41:20] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:46:04] !log jclark@cumin1002 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:48:21] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:51:50] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11153781 (10Jdlrobson-WMF) Hey @krinkle! In general, I'm fairly confident you have thought of everything but my time here has shown me there is always so... [20:59:55] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:02:06] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [21:02:15] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11153806 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [21:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:18:04] jclark@cumin1002 reimage (PID 174891) is awaiting input [21:18:22] (03CR) 10Dwisehaupt: [C:03+1] Redo nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/1185193 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [21:27:20] (03CR) 10Dzahn: [C:03+2] Redo nsca_frack_cfg.erb add frmx2002/frdata2002, remove frmx2001/frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/1185193 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [21:27:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T402925)', diff saved to https://phabricator.wikimedia.org/P82664 and previous config saved to /var/cache/conftool/dbconfig/20250905-212721-ladsgroup.json [21:27:26] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [21:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:35:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T400442#11153885 (10VRiley-WMF) a:03VRiley-WMF [21:36:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T400442#11153886 (10VRiley-WMF) This is completed [21:36:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T400442#11153887 (10VRiley-WMF) [21:42:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P82665 and previous config saved to /var/cache/conftool/dbconfig/20250905-214229-ladsgroup.json [21:43:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:47:20] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [21:47:31] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11153912 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm executed with errors: - dse-k8s-worker10... [21:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:57:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P82666 and previous config saved to /var/cache/conftool/dbconfig/20250905-215736-ladsgroup.json [22:01:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:04:46] (03CR) 10Dzahn: [C:03+2] "no icinga errors or warnings. looks all good" [puppet] - 10https://gerrit.wikimedia.org/r/1185193 (https://phabricator.wikimedia.org/T403674) (owner: 10Jgreen) [22:06:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 1.798 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:12:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T402925)', diff saved to https://phabricator.wikimedia.org/P82667 and previous config saved to /var/cache/conftool/dbconfig/20250905-221244-ladsgroup.json [22:12:48] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [22:13:01] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2239.codfw.wmnet with reason: Maintenance [22:18:12] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [22:21:40] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1049 - vriley@cumin1003" [22:22:00] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1049 - vriley@cumin1003" [22:22:00] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:22:21] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es1049 [22:23:38] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1049 [22:24:21] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host es1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:29:52] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a5-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403867 (10phaultfinder) 03NEW [22:32:31] vriley@cumin1003 provision (PID 1203425) is awaiting input [22:33:57] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:50:27] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for musikanimal - https://phabricator.wikimedia.org/T403868 (10thcipriani) 03NEW [22:50:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:51:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11153977 (10VRiley-WMF) [23:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:25:06] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1049.eqiad.wmnet with OS bookworm [23:25:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11154034 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1049.eqiad.wmnet with OS bookworm [23:30:08] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184951 (owner: 10TrainBranchBot) [23:31:24] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [23:37:31] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1050 - vriley@cumin1003" [23:37:36] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1050 - vriley@cumin1003" [23:37:36] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:37:56] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es1050 [23:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1185225 [23:38:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1185225 (owner: 10TrainBranchBot) [23:39:10] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1050 [23:40:11] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host es1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:45:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11154046 (10VRiley-WMF) [23:52:27] FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [23:53:25] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1049.eqiad.wmnet with reason: host reimage [23:55:01] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1185225 (owner: 10TrainBranchBot) [23:58:36] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1049.eqiad.wmnet with reason: host reimage [23:59:43] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring