[00:01:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:08:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1187572 [00:08:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1187572 (owner: 10TrainBranchBot) [00:11:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:13:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:25:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:30:29] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1187572 (owner: 10TrainBranchBot) [00:32:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:43:58] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:01:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:03:58] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [01:11:30] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.493 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:30:13] (03CR) 10Ssingh: [C:03+2] "This should now be fixed. For posterity: add a service::catalog override with snakeoil data to make the LVS puppetization bits happy." [puppet] - 10https://gerrit.wikimedia.org/r/1187472 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [01:32:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:55] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:04:06] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:39:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:46:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 3.451 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:30:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:35:10] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.081 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:50:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:51:14] (03PS13) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [03:51:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:55:12] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.944 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:59:47] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [04:01:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.656 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:13:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:20:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:30:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:43:58] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:03:58] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [05:30:25] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Update [05:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:38:15] (03CR) 10Ayounsi: "Overall lgtm, some minor comments." [homer/public] - 10https://gerrit.wikimedia.org/r/1187092 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [05:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250912T0600) [06:04:06] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:22:07] 06SRE, 06Data-Engineering, 06Traffic: Add pageview information to turnilo's webrequest_sampled_live (is_pageview is always "-") - https://phabricator.wikimedia.org/T402612#11175061 (10JAllemandou) >>! In T402612#11174218, @CDanis wrote: > I can find time to reimplement the 'true' logic (or maybe a very sligh... [06:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:31:48] 10ops-eqiad, 06DC-Ops: Take kafka-jumbo100[7-9] out of service, ready for decom - https://phabricator.wikimedia.org/T397447#11175075 (10brouberol) a:05brouberol→03None [06:32:45] 10ops-eqiad, 06DC-Ops: Take kafka-jumbo100[7-9] out of service, ready for decom - https://phabricator.wikimedia.org/T397447#11175104 (10brouberol) @wiki_willy Thanks! I created T404413 for the kafka hosts. [06:37:52] The alert 'PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled' is from the fact that we do not have much behind ingress. We are deploying a small application to handle this. cc sukhe [06:40:59] (03PS1) 10Aklapper: phabricator weekly changes email: List User Project tags with issues [puppet] - 10https://gerrit.wikimedia.org/r/1187657 (https://phabricator.wikimedia.org/T404411) [06:59:53] (03CR) 10Elukey: [C:03+2] role::deployment_server::kubernetes: upgrade the default statsd image [puppet] - 10https://gerrit.wikimedia.org/r/1187489 (https://phabricator.wikimedia.org/T404368) (owner: 10Elukey) [06:59:56] (03PS1) 10Muehlenhoff: imposm-initial-import: Add the reindex step to the script [puppet] - 10https://gerrit.wikimedia.org/r/1187660 (https://phabricator.wikimedia.org/T381565) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250912T0700) [07:00:58] (03CR) 10Elukey: [C:03+1] imposm-initial-import: Add the reindex step to the script [puppet] - 10https://gerrit.wikimedia.org/r/1187660 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:19:13] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [07:23:07] (03PS1) 10Muehlenhoff: maps1011: Enable timers for OSM sync and waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1187666 (https://phabricator.wikimedia.org/T381565) [07:26:04] akosiaris: We discovered a data loss bug this morning, and I would like to deploy a revert patch to fix it today. Here's a description of the symptom: https://phabricator.wikimedia.org/T356471#11175220 [07:27:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187666 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:27:29] Here's the emergency rollback, which has been reviewed by my team: https://gerrit.wikimedia.org/r/1187665 [07:30:27] (03PS1) 10Elukey: preseed: fix dse-k8s-worker1014's partman config [puppet] - 10https://gerrit.wikimedia.org/r/1187669 (https://phabricator.wikimedia.org/T394357) [07:32:22] (03CR) 10Brouberol: "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1187463 (owner: 10Brouberol) [07:33:43] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [07:34:00] (03CR) 10Elukey: [C:03+2] preseed: fix dse-k8s-worker1014's partman config [puppet] - 10https://gerrit.wikimedia.org/r/1187669 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [07:34:32] (03PS1) 10Huei Tan: AX: Enable entry-points on 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187670 (https://phabricator.wikimedia.org/T404420) [07:37:04] !log installing libcpanel-json-xs-perl security updates [07:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:05] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [07:40:52] (03CR) 10KCVelaga: "Wouldn't this enable the entry points for all users?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187670 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [07:41:10] (03PS18) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [07:41:13] (03CR) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API (0314 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [07:44:36] (03PS1) 10Brouberol: Update tag for flink 1.20.2 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187673 (https://phabricator.wikimedia.org/T400600) [07:46:41] Amir1: please see my message above, we're hoping to do an emergency rollback today. [07:47:13] (03PS2) 10Brouberol: Update tag for flink 1.20.2 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187673 (https://phabricator.wikimedia.org/T400600) [07:52:58] (03PS1) 10Muehlenhoff: maps2011: Enable timers for OSM sync and waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1187675 (https://phabricator.wikimedia.org/T381565) [07:53:15] !log temporary upgrading haproxykafka on cp7001 to a test version to check for possible encoding issues (T401383) [07:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:21] T401383: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383 [07:54:19] (03CR) 10DCausse: [C:03+1] Update tag for flink 1.20.2 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187673 (https://phabricator.wikimedia.org/T400600) (owner: 10Brouberol) [07:54:39] (03Abandoned) 10Huei Tan: AX: Enable entry-points on 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187670 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [07:55:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:55:47] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [07:55:51] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v11.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187676 [07:56:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:57:15] (03CR) 10Vgutierrez: P:cache:haproxy add datacenter information to provenance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1182782 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:57:36] (03CR) 10Brouberol: [C:03+2] Update tag for flink 1.20.2 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187673 (https://phabricator.wikimedia.org/T400600) (owner: 10Brouberol) [07:57:39] (03CR) 10Brouberol: [V:03+2 C:03+2] Update tag for flink 1.20.2 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187673 (https://phabricator.wikimedia.org/T400600) (owner: 10Brouberol) [07:59:26] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1014.eqiad.wmnet with reason: host reimage [08:03:24] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1014.eqiad.wmnet with reason: host reimage [08:04:10] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v11.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187676 (owner: 10Elukey) [08:05:03] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:05:44] (03PS1) 10Elukey: Upstream release v11.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1187733 [08:05:56] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v11.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1187733 (owner: 10Elukey) [08:06:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 5.507 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:12:41] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11175296 (10elukey) @Jclark-ctr I fixed the partman config with https://gerrit.wikimedia.org/r/1187669, and reimaged the host to bookworm. We should be good to close the taks! I'll let you do it... [08:13:43] (03PS19) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [08:13:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:15:01] (03CR) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [08:15:24] (03CR) 10Elukey: [C:03+1] maps1011: Enable timers for OSM sync and waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1187666 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:16:13] !log uploaded spicerack_11.5.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [08:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:29] (03CR) 10Volans: "reply inline" [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [08:18:28] (03PS20) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [08:18:29] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:19:32] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [08:21:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:22:16] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:22:36] elukey@cumin1003 reimage (PID 2216980) is awaiting input [08:22:42] (03PS1) 10Muehlenhoff: ssh: Disable X11 for the new-style sshd.d template [puppet] - 10https://gerrit.wikimedia.org/r/1187737 (https://phabricator.wikimedia.org/T400478) [08:23:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187675 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:25:13] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:25:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187737 (https://phabricator.wikimedia.org/T400478) (owner: 10Muehlenhoff) [08:26:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 1.048 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:29:51] awight: sorry, missed your ping in the noise. Go ahead, I 'd say [08:30:53] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [08:30:53] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [08:32:50] akosiaris: ty! will do it now. [08:32:56] (03PS3) 10Effie Mouzeli: P:trafficserver: switch hcaptcha to proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [08:33:32] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [08:34:05] (03PS4) 10Effie Mouzeli: P:trafficserver: switch hcaptcha to proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [08:34:34] (03Abandoned) 10Effie Mouzeli: hcaptcha: define timeouts for hcaptcha [puppet] - 10https://gerrit.wikimedia.org/r/1187471 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [08:40:03] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.224 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:42:52] (03PS1) 10Kosta Harlan: hCaptcha: Disable hCaptcha for projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187740 [08:43:07] !log btullis@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [08:43:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1233.eqiad.wmnet with OS bullseye [08:43:08] (03CR) 10Kosta Harlan: [C:04-2] "This is just to save a few minutes of time in the event that we need to quickly disable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187740 (owner: 10Kosta Harlan) [08:43:43] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "T398438 - btullis@cumin1003" [08:43:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "T398438 - btullis@cumin1003" [08:43:48] T398438: Decommission or recommission all snapshot and dumpsdata servers - https://phabricator.wikimedia.org/T398438 [08:43:59] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1234.eqiad.wmnet with OS bullseye [08:49:42] (03PS1) 10Awight: Revert "Remove refs from reference lists if there are no references left to them" [extensions/Cite] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187741 (https://phabricator.wikimedia.org/T356471) [08:50:22] (03PS21) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [08:50:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/Cite] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187741 (https://phabricator.wikimedia.org/T356471) (owner: 10Awight) [09:01:34] (03Merged) 10jenkins-bot: Revert "Remove refs from reference lists if there are no references left to them" [extensions/Cite] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187741 (https://phabricator.wikimedia.org/T356471) (owner: 10Awight) [09:02:07] !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1187741|Revert "Remove refs from reference lists if there are no references left to them" (T356471)]] [09:02:12] T356471: The VisualEditor cannot add or remove list-defined references - https://phabricator.wikimedia.org/T356471 [09:03:58] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [09:05:17] (03PS5) 10Slyngshede: P:cache:haproxy add datacenter information to provenance [puppet] - 10https://gerrit.wikimedia.org/r/1182782 (https://phabricator.wikimedia.org/T398161) [09:07:52] (03PS2) 10Ssingh: P:haptcha: set PKI for proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187499 (https://phabricator.wikimedia.org/T403416) [09:08:23] !log awight@deploy1003 awight: Backport for [[gerrit:1187741|Revert "Remove refs from reference lists if there are no references left to them" (T356471)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:08:27] T356471: The VisualEditor cannot add or remove list-defined references - https://phabricator.wikimedia.org/T356471 [09:08:57] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6906/co" [puppet] - 10https://gerrit.wikimedia.org/r/1182782 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:09:04] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1234.eqiad.wmnet with reason: host reimage [09:09:16] (03PS3) 10Effie Mouzeli: P:haptcha: set PKI for proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187499 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [09:10:06] !log awight@deploy1003 awight: Continuing with sync [09:11:18] (03CR) 10Vgutierrez: [C:04-2] "we don't need to issue a certificate for proxoid.discovery.wmnet, hcaptcha already has the required certs:" [puppet] - 10https://gerrit.wikimedia.org/r/1187499 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [09:13:27] (03CR) 10Slyngshede: P:cache:haproxy add datacenter information to provenance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1182782 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:14:01] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1234.eqiad.wmnet with reason: host reimage [09:15:21] !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187741|Revert "Remove refs from reference lists if there are no references left to them" (T356471)]] (duration: 13m 14s) [09:15:25] T356471: The VisualEditor cannot add or remove list-defined references - https://phabricator.wikimedia.org/T356471 [09:16:31] akosiaris: All done, thanks! I'm around for at least 3 hours in case anyone reports a problem, but I think this patch was safe. [09:30:38] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [09:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:42] btullis@cumin1003 reimage (PID 2225247) is awaiting input [09:33:50] (03PS1) 10Elukey: redfish: increase timeout for Dell's change_user_password request [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187748 (https://phabricator.wikimedia.org/T392851) [09:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:36:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [09:36:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1234.eqiad.wmnet with OS bullseye [09:37:48] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye [09:41:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:43:31] (03PS1) 10Effie Mouzeli: P:haptcha: only listen to local addresses [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) [09:44:18] (03PS2) 10Effie Mouzeli: P:haptcha: only listen to local addresses [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) [09:44:33] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [09:46:36] (03CR) 10Elukey: [C:03+2] "The change is trivial and I tested it via spicerack-shell, so I am going to self-merge so I can build the new spicerack relase and test th" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187748 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [09:46:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:48:56] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v11.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187753 [09:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:52:02] (03CR) 10Muehlenhoff: [C:03+2] maps1011: Enable timers for OSM sync and waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1187666 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:55:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:56:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:57:00] btullis@cumin1003 reimage (PID 2232956) is awaiting input [09:58:00] (03CR) 10Hnowlan: [C:03+1] imposm-initial-import: Add the reindex step to the script [puppet] - 10https://gerrit.wikimedia.org/r/1187660 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:00:06] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v11.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187753 (owner: 10Elukey) [10:00:12] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.961 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:01:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 3.027 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:01:29] (03PS1) 10Elukey: Upstream release v11.6.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1187754 [10:01:43] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v11.6.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1187754 (owner: 10Elukey) [10:01:58] (03CR) 10Ladsgroup: [C:03+1] ssh: Disable X11 for the new-style sshd.d template [puppet] - 10https://gerrit.wikimedia.org/r/1187737 (https://phabricator.wikimedia.org/T400478) (owner: 10Muehlenhoff) [10:02:17] (03CR) 10Majavah: [C:03+1] ssh: Disable X11 for the new-style sshd.d template [puppet] - 10https://gerrit.wikimedia.org/r/1187737 (https://phabricator.wikimedia.org/T400478) (owner: 10Muehlenhoff) [10:04:06] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:04:44] (03CR) 10Ladsgroup: [C:03+1] Add a dummy secret file containing the wikiadmin password [labs/private] - 10https://gerrit.wikimedia.org/r/1187463 (owner: 10Brouberol) [10:06:10] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1235.eqiad.wmnet with OS bullseye [10:06:25] FIRING: SystemdUnitFailed: imposm.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:36] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye [10:09:45] (03PS1) 10Filippo Giunchedi: profile: ship Cloud VPS root authorized-keys [puppet] - 10https://gerrit.wikimedia.org/r/1187757 (https://phabricator.wikimedia.org/T317362) [10:10:16] PROBLEM - TFTP service on install1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [10:12:19] (03CR) 10Filippo Giunchedi: "The idea is to nuke ssh::userkey from labs/private.git once this is rolled out and remove the !defined here" [puppet] - 10https://gerrit.wikimedia.org/r/1187757 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [10:12:35] (03CR) 10CI reject: [V:04-1] profile: ship Cloud VPS root authorized-keys [puppet] - 10https://gerrit.wikimedia.org/r/1187757 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [10:12:47] !log uploaded spicerack_11.6.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [10:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:16] (03PS2) 10Filippo Giunchedi: profile: ship Cloud VPS root authorized-keys [puppet] - 10https://gerrit.wikimedia.org/r/1187757 (https://phabricator.wikimedia.org/T317362) [10:14:27] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:14:54] (03CR) 10Majavah: profile: ship Cloud VPS root authorized-keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187757 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [10:15:10] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:16:06] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:17:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:17:12] (03CR) 10Brouberol: [C:03+2] Add a dummy secret file containing the wikiadmin password [labs/private] - 10https://gerrit.wikimedia.org/r/1187463 (owner: 10Brouberol) [10:17:13] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:17:14] (03CR) 10Brouberol: [V:03+2 C:03+2] Add a dummy secret file containing the wikiadmin password [labs/private] - 10https://gerrit.wikimedia.org/r/1187463 (owner: 10Brouberol) [10:17:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:17:45] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:18:58] FIRING: [3x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [10:21:27] (03PS2) 10Phuedx: WikimediaEvents: Disable client-side error logging for certain wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187382 (https://phabricator.wikimedia.org/T400068) [10:21:35] (03CR) 10Phuedx: WikimediaEvents: Disable client-side error logging for certain wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187382 (https://phabricator.wikimedia.org/T400068) (owner: 10Phuedx) [10:22:01] elukey@cumin2002 provision (PID 3544527) is awaiting input [10:22:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187382 (https://phabricator.wikimedia.org/T400068) (owner: 10Phuedx) [10:23:21] !log upgrade spicerack to 0.11.6 to all cumin hosts [10:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:50] (03CR) 10Elukey: [C:03+1] maps2011: Enable timers for OSM sync and waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1187675 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:24:11] (03CR) 10Muehlenhoff: [C:03+2] maps2011: Enable timers for OSM sync and waterlines import [puppet] - 10https://gerrit.wikimedia.org/r/1187675 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:25:29] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1235.eqiad.wmnet with OS bullseye [10:25:53] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye [10:26:07] (03PS3) 10Filippo Giunchedi: profile: ship Cloud VPS root authorized-keys [puppet] - 10https://gerrit.wikimedia.org/r/1187757 (https://phabricator.wikimedia.org/T317362) [10:26:18] (03CR) 10Filippo Giunchedi: profile: ship Cloud VPS root authorized-keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187757 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [10:26:33] (03PS3) 10Effie Mouzeli: P:haptcha: only listen to local addresses [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) [10:27:48] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [10:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:30:17] (03CR) 10Elukey: [C:03+1] ssh: Disable X11 for the new-style sshd.d template [puppet] - 10https://gerrit.wikimedia.org/r/1187737 (https://phabricator.wikimedia.org/T400478) (owner: 10Muehlenhoff) [10:32:14] (03CR) 10Majavah: [C:03+1] "LGTM, assuming this profile is also included in Pontoon etc" [puppet] - 10https://gerrit.wikimedia.org/r/1187757 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [10:33:44] (03CR) 10Suzanne Wood: [C:03+1] Remove feature flag to resolve changelist wikibase link labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184480 (https://phabricator.wikimedia.org/T395674) (owner: 10Joely Rooke WMDE) [10:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:36:59] arnaudb@cumin1003 arnaudb: The backup on gitlab1004 is complete, ready to proceed with upgrade. [10:39:59] arnaudb@cumin1003 upgrade (PID 2204827) is awaiting input [10:43:25] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1235.eqiad.wmnet with OS bullseye [10:45:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:55:48] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye [11:00:05] (03PS1) 10Brouberol: global_config: inject wikiadmin_usermame in the mariadb section [puppet] - 10https://gerrit.wikimedia.org/r/1187766 (https://phabricator.wikimedia.org/T404162) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250912T0700) [11:00:05] jelto, arnoldokoth, and mutante: Time to do the GitLab version upgrades deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250912T1100). [11:00:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host db1300.eqiad.wmnet [11:01:51] (03PS1) 10Brouberol: mediawiki-dumps-legacy: mount a secret containing the wikiadmin credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187767 (https://phabricator.wikimedia.org/T404162) [11:01:59] (03PS1) 10Brouberol: Rely on a configuration file to provide the credentials to the mysql CLI [dumps] - 10https://gerrit.wikimedia.org/r/1187768 (https://phabricator.wikimedia.org/T404162) [11:02:20] (03CR) 10CI reject: [V:04-1] Rely on a configuration file to provide the credentials to the mysql CLI [dumps] - 10https://gerrit.wikimedia.org/r/1187768 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:03:11] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187766 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:03:20] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: mount a secret containing the wikiadmin credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187767 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:04:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host db1300.eqiad.wmnet [11:05:05] (03CR) 10Hnowlan: "lgtm mostly, some minor cleanup" [cookbooks] - 10https://gerrit.wikimedia.org/r/1187544 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [11:05:35] (03PS4) 10Muehlenhoff: Make maps2012-2014 replica nodes [puppet] - 10https://gerrit.wikimedia.org/r/1187457 (https://phabricator.wikimedia.org/T381565) [11:06:07] (03PS2) 10Brouberol: Rely on a configuration file to provide the credentials to the mysql CLI [dumps] - 10https://gerrit.wikimedia.org/r/1187768 (https://phabricator.wikimedia.org/T404162) [11:07:29] (03CR) 10Btullis: [C:03+1] Rely on a configuration file to provide the credentials to the mysql CLI [dumps] - 10https://gerrit.wikimedia.org/r/1187768 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:07:47] (03PS1) 10Vgutierrez: haproxy: Provide an ASCII decoder [puppet] - 10https://gerrit.wikimedia.org/r/1187770 (https://phabricator.wikimedia.org/T401383) [11:08:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187457 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:16:34] (03CR) 10Ladsgroup: [C:03+1] global_config: inject wikiadmin_usermame in the mariadb section [puppet] - 10https://gerrit.wikimedia.org/r/1187766 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:16:59] (03CR) 10Brouberol: [C:03+2] global_config: inject wikiadmin_usermame in the mariadb section [puppet] - 10https://gerrit.wikimedia.org/r/1187766 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:17:47] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host db2202.codfw.wmnet [11:18:01] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host db2202.codfw.wmnet [11:18:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:18:18] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: mount a secret containing the wikiadmin credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187767 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:18:21] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host db2202.codfw.wmnet [11:23:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:23:58] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:26:02] (03PS1) 10Jcrespo: mariadb: Upgrade db2202 (test-s1) to mariadb 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1187773 (https://phabricator.wikimedia.org/T394371) [11:26:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [11:27:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [11:28:25] (03CR) 10Brouberol: [C:03+2] Rely on a configuration file to provide the credentials to the mysql CLI [dumps] - 10https://gerrit.wikimedia.org/r/1187768 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:28:51] (03CR) 10Ladsgroup: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1187773 (https://phabricator.wikimedia.org/T394371) (owner: 10Jcrespo) [11:29:58] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db2202.codfw.wmnet [11:30:23] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db2202 (test-s1) to mariadb 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1187773 (https://phabricator.wikimedia.org/T394371) (owner: 10Jcrespo) [11:31:47] (03PS1) 10Brouberol: mediawiki-dumps-legacy: use a non-opaque Secret for mariadb config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187774 (https://phabricator.wikimedia.org/T404162) [11:34:36] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: use a non-opaque Secret for mariadb config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187774 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:35:18] (03PS2) 10Sergio Gimeno: beta(Growth,MetricsPlatform): add notification experiment config and enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173396 (https://phabricator.wikimedia.org/T400048) (owner: 10Cyndywikime) [11:35:50] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [11:35:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [11:38:58] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:43:06] (03PS1) 10Btullis: Switch the wdqs-internal services from http to https [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) [11:44:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [11:46:12] (03PS5) 10Effie Mouzeli: P:trafficserver: switch hcaptcha to proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [11:46:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [11:48:08] (03CR) 10Ayounsi: [C:03+1] redfish: increase timeout for Dell's change_user_password request [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187748 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [11:48:23] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1235.eqiad.wmnet with OS bullseye [11:48:42] (03PS1) 10Brouberol: mediawiki-dumps-legacy: use a non-opaque Secret for mariadb config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187776 (https://phabricator.wikimedia.org/T404162) [11:50:29] (03CR) 10Btullis: mediawiki-dumps-legacy: use a non-opaque Secret for mariadb config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187776 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:50:59] (03PS2) 10Brouberol: mediawiki-dumps-legacy: use a string Secret for mariadb config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187776 (https://phabricator.wikimedia.org/T404162) [11:51:30] (03PS3) 10Brouberol: mediawiki-dumps-legacy: use a string Secret for mariadb config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187776 (https://phabricator.wikimedia.org/T404162) [11:51:31] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye [11:51:48] (03CR) 10Brouberol: mediawiki-dumps-legacy: use a string Secret for mariadb config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187776 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:52:23] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: use a string Secret for mariadb config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187776 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:54:06] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: use a string Secret for mariadb config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187776 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [11:55:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1187757 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [11:56:07] (03PS2) 10Btullis: Switch the wdqs-internal services from http to https [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) [11:56:10] (03PS1) 10Vgutierrez: haproxy: Provide an utf8ps decoder [puppet] - 10https://gerrit.wikimedia.org/r/1187777 (https://phabricator.wikimedia.org/T401383) [11:56:30] (03PS22) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [11:57:28] (03CR) 10Ayounsi: Nokia: Add support for Python config generation and JSON-RPC API (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [11:57:34] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6907/co" [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) (owner: 10Btullis) [11:58:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [11:58:03] (03PS6) 10Effie Mouzeli: P:trafficserver: switch hcaptcha to proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [11:58:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [11:58:43] (03PS3) 10Btullis: Switch the wdqs-internal services from http to https [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) [11:59:19] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1160* gradually with 4 steps - Work done [11:59:38] (03PS7) 10Effie Mouzeli: P:trafficserver: switch hcaptcha to proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [11:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:00:05] (03CR) 10Vgutierrez: [C:04-1] "it looks like the fact that you're using it's excluding the VIP address, 10.2.2.12 should be on PCC output as well?" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [12:00:13] (03PS8) 10Effie Mouzeli: P:trafficserver: switch hcaptcha to proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [12:00:23] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6908/co" [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) (owner: 10Btullis) [12:00:40] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1167.eqiad.wmnet with reason: Maintenance [12:00:59] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:01:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1167 (T402763)', diff saved to https://phabricator.wikimedia.org/P83251 and previous config saved to /var/cache/conftool/dbconfig/20250912-120106-ladsgroup.json [12:01:10] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [12:01:21] (03CR) 10Effie Mouzeli: "I assumed that we can't see the VIP due to CI" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [12:02:11] (03PS2) 10Vgutierrez: haproxy: Provide an ASCII decoder [puppet] - 10https://gerrit.wikimedia.org/r/1187770 (https://phabricator.wikimedia.org/T401383) [12:02:11] (03PS2) 10Vgutierrez: haproxy: Provide an utf8ps decoder [puppet] - 10https://gerrit.wikimedia.org/r/1187777 (https://phabricator.wikimedia.org/T401383) [12:02:19] (03CR) 10Effie Mouzeli: "I am afraid I do not have a definitive answer" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [12:02:52] (03PS4) 10Btullis: Switch the wdqs-internal services from http to https [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) [12:02:54] (03PS3) 10D3r1ck01: session: Enable MultiBackendSessionStore on `group0` wikis only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) [12:04:45] (03PS1) 10Effie Mouzeli: P:hcatcha: set nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) [12:04:49] (03CR) 10Majavah: [C:04-1] "The generated config line:" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [12:04:53] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) (owner: 10Btullis) [12:07:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T402763)', diff saved to https://phabricator.wikimedia.org/P83252 and previous config saved to /var/cache/conftool/dbconfig/20250912-120746-ladsgroup.json [12:07:51] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [12:08:09] (03CR) 10CI reject: [V:04-1] P:hcatcha: set nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [12:08:49] (03PS2) 10Effie Mouzeli: P:hcatcha: set nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) [12:11:12] (03CR) 10Effie Mouzeli: "in https://puppet-compiler.wmflabs.org/output/1187751/7468/urldownloader1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [12:11:47] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [12:13:10] (03PS3) 10Vgutierrez: haproxy: Provide an utf8ps decoder [puppet] - 10https://gerrit.wikimedia.org/r/1187777 (https://phabricator.wikimedia.org/T401383) [12:13:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:30] (03CR) 10Vgutierrez: [C:03+1] "this should be OK, deploy it disabling puppet on A:cp-text and validating that ATS is happy on a single host" [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [12:16:38] (03CR) 10Majavah: [C:04-1] "I'm looking at https://puppet-compiler.wmflabs.org/output/1187751/4952/urldownloader1004.wikimedia.org/index.html (which is the Puppet 7 v" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [12:21:25] (03PS1) 10Elukey: preseed: fix partman config for dse-k8s-worker2003 [puppet] - 10https://gerrit.wikimedia.org/r/1187788 (https://phabricator.wikimedia.org/T399778) [12:22:11] (03PS1) 10Jclark-ctr: Add dse-k8s-worker2003 to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1187789 (https://phabricator.wikimedia.org/T399778) [12:22:11] (03PS2) 10Elukey: preseed: fix partman config for dse-k8s-worker2003 [puppet] - 10https://gerrit.wikimedia.org/r/1187788 (https://phabricator.wikimedia.org/T399778) [12:22:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P83253 and previous config saved to /var/cache/conftool/dbconfig/20250912-122254-ladsgroup.json [12:23:31] (03CR) 10Vgutierrez: [C:04-1] "nope, the fact doesn't include the VIP, it can be checked on the host checking `sudo -i facter -p` output:" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [12:26:02] (03Abandoned) 10Jclark-ctr: Add dse-k8s-worker2003 to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1187789 (https://phabricator.wikimedia.org/T399778) (owner: 10Jclark-ctr) [12:29:33] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Update [12:29:45] (03Abandoned) 10Ayounsi: Rancid: use port 2222 for mgmt routers [puppet] - 10https://gerrit.wikimedia.org/r/890402 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [12:29:47] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move management routers ssh port - https://phabricator.wikimedia.org/T277438#11175906 (10ayounsi) 05Open→03Declined Feel free to re-open if you disagree, but looks like we might not need to get to that heavy port (and tooling) chang... [12:29:56] (03Abandoned) 10Ayounsi: Management routers: move ssh port to 2222 [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [12:34:05] (03CR) 10Effie Mouzeli: [C:03+2] P:trafficserver: switch hcaptcha to proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [12:37:02] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11175926 (10WMDE-leszek) @KFrancis it is elisha.cohen AT wikimedia DOT de thank you [12:38:02] (03Abandoned) 10Ayounsi: tox: remove python 3.9 and 3.10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1161342 (owner: 10Ayounsi) [12:38:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P83254 and previous config saved to /var/cache/conftool/dbconfig/20250912-123801-ladsgroup.json [12:38:19] !log jforrester Deployed security patch for T404392 [12:43:59] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:50:28] (03CR) 10Fabfur: [C:03+1] "LGTM, just a couple of [nitpicks]" [puppet] - 10https://gerrit.wikimedia.org/r/1187777 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [12:53:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T402763)', diff saved to https://phabricator.wikimedia.org/P83256 and previous config saved to /var/cache/conftool/dbconfig/20250912-125309-ladsgroup.json [12:53:15] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [12:53:25] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:55:12] 06SRE, 10Infrastructure Security: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824#11175985 (10taavi) Anything left to do here? [12:56:43] (03PS1) 10Vgutierrez: trafficserver: Do not leak alt-svc headers from applayer [puppet] - 10https://gerrit.wikimedia.org/r/1187796 [12:57:49] (03PS2) 10Vgutierrez: trafficserver: Do not leak alt-svc headers from applayer [puppet] - 10https://gerrit.wikimedia.org/r/1187796 [12:58:25] (03CR) 10Fabfur: [C:03+1] haproxy: Provide an utf8ps decoder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187777 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [12:59:42] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1172.eqiad.wmnet with reason: Maintenance [12:59:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1172 (T402763)', diff saved to https://phabricator.wikimedia.org/P83259 and previous config saved to /var/cache/conftool/dbconfig/20250912-125949-ladsgroup.json [12:59:54] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [13:01:10] (03CR) 10Ssingh: [C:03+1] trafficserver: Do not leak alt-svc headers from applayer [puppet] - 10https://gerrit.wikimedia.org/r/1187796 (owner: 10Vgutierrez) [13:02:09] (03CR) 10Elukey: Replace elasticsearch api with python requests (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [13:02:18] (03CR) 10Vgutierrez: [C:03+2] trafficserver: Do not leak alt-svc headers from applayer [puppet] - 10https://gerrit.wikimedia.org/r/1187796 (owner: 10Vgutierrez) [13:02:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1185948 (owner: 10Jforrester) [13:03:36] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1187788 (https://phabricator.wikimedia.org/T399778) (owner: 10Elukey) [13:04:27] (03PS1) 10Ssingh: P:trafficserver: use proper syntax for ts.server_response [puppet] - 10https://gerrit.wikimedia.org/r/1187799 [13:04:52] (03PS1) 10Vgutierrez: trafficserver: Fix syntax error on default.lua [puppet] - 10https://gerrit.wikimedia.org/r/1187800 [13:05:06] (03CR) 10Ssingh: [C:03+1] trafficserver: Fix syntax error on default.lua [puppet] - 10https://gerrit.wikimedia.org/r/1187800 (owner: 10Vgutierrez) [13:05:29] (03CR) 10Elukey: [C:03+2] preseed: fix partman config for dse-k8s-worker2003 [puppet] - 10https://gerrit.wikimedia.org/r/1187788 (https://phabricator.wikimedia.org/T399778) (owner: 10Elukey) [13:05:45] (03CR) 10Effie Mouzeli: [C:03+1] trafficserver: Fix syntax error on default.lua [puppet] - 10https://gerrit.wikimedia.org/r/1187800 (owner: 10Vgutierrez) [13:05:45] (03Abandoned) 10Ssingh: P:trafficserver: use proper syntax for ts.server_response [puppet] - 10https://gerrit.wikimedia.org/r/1187799 (owner: 10Ssingh) [13:06:37] (03CR) 10Muehlenhoff: [C:03+2] abstractwiki-rust-web: Bump version to 1.85, rustc-web upgraded over the weekend [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1185948 (owner: 10Jforrester) [13:06:39] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] abstractwiki-rust-web: Bump version to 1.85, rustc-web upgraded over the weekend [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1185948 (owner: 10Jforrester) [13:07:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T402763)', diff saved to https://phabricator.wikimedia.org/P83260 and previous config saved to /var/cache/conftool/dbconfig/20250912-130724-ladsgroup.json [13:07:25] is something wrong with our CI? [13:07:29] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [13:07:33] (03CR) 10Ssingh: "Yeah makes sense given the valid endpoint certs in place already. I will abandon." [puppet] - 10https://gerrit.wikimedia.org/r/1187499 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [13:07:35] (03Abandoned) 10Ssingh: P:haptcha: set PKI for proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187499 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [13:08:09] (03CR) 10Vgutierrez: [C:03+2] trafficserver: Fix syntax error on default.lua [puppet] - 10https://gerrit.wikimedia.org/r/1187800 (owner: 10Vgutierrez) [13:08:16] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:10:14] looks fine [13:10:18] ah v6 [13:10:30] moritzm: any ongoing work on that? [13:11:42] (03PS1) 10Majavah: hieradata: Replace eqiad1 bastion used in tests [puppet] - 10https://gerrit.wikimedia.org/r/1187803 (https://phabricator.wikimedia.org/T392689) [13:11:43] (03PS1) 10Majavah: hieradata: Drop old eqiad1 bastions [puppet] - 10https://gerrit.wikimedia.org/r/1187804 (https://phabricator.wikimedia.org/T392689) [13:12:21] (03PS1) 10Stevemunene: dse-k8s: Define echoserver namespace for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187805 (https://phabricator.wikimedia.org/T404433) [13:12:23] (03PS1) 10Stevemunene: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) [13:13:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:14:16] slyngs: not me, Simon have you been testing something? [13:14:46] (03Abandoned) 10Vgutierrez: haproxy: Provide an ASCII decoder [puppet] - 10https://gerrit.wikimedia.org/r/1187770 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [13:15:38] (03PS4) 10Vgutierrez: haproxy: Provide an utf8ps decoder [puppet] - 10https://gerrit.wikimedia.org/r/1187777 (https://phabricator.wikimedia.org/T401383) [13:19:13] (03CR) 10Arnaudb: [C:03+2] Revert^5 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1187373 (owner: 10Arnaudb) [13:20:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:21:07] (03PS1) 10Muehlenhoff: Failover idp.w.o to idp1004 [dns] - 10https://gerrit.wikimedia.org/r/1187809 [13:21:32] (03CR) 10ArielGlenn: [C:03+1] "Thanks for catching this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187476 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza) [13:21:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:21:46] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1235.eqiad.wmnet with OS bullseye [13:22:28] (03PS5) 10Vgutierrez: haproxy: Provide an utf8ps converter [puppet] - 10https://gerrit.wikimedia.org/r/1187777 (https://phabricator.wikimedia.org/T401383) [13:22:32] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P83261 and previous config saved to /var/cache/conftool/dbconfig/20250912-132231-ladsgroup.json [13:25:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:26:00] (03CR) 10Vgutierrez: haproxy: Provide an utf8ps converter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1187777 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [13:26:30] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.559 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:32:41] (03CR) 10Fabfur: [C:03+1] haproxy: Provide an utf8ps converter [puppet] - 10https://gerrit.wikimedia.org/r/1187777 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [13:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:34:06] (03CR) 10Vgutierrez: [C:03+2] haproxy: Provide an utf8ps converter [puppet] - 10https://gerrit.wikimedia.org/r/1187777 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [13:37:22] (03PS4) 10Effie Mouzeli: (WIP)P:haptcha: only listen to local addresses [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) [13:37:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P83262 and previous config saved to /var/cache/conftool/dbconfig/20250912-133739-ladsgroup.json [13:39:35] (03CR) 10FNegri: [C:03+1] hieradata: Replace eqiad1 bastion used in tests [puppet] - 10https://gerrit.wikimedia.org/r/1187803 (https://phabricator.wikimedia.org/T392689) (owner: 10Majavah) [13:42:25] (03CR) 10Majavah: [C:03+2] hieradata: Replace eqiad1 bastion used in tests [puppet] - 10https://gerrit.wikimedia.org/r/1187803 (https://phabricator.wikimedia.org/T392689) (owner: 10Majavah) [13:42:56] (03PS1) 10Arnaudb: gerrit: mod_qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1187810 (https://phabricator.wikimedia.org/T402611) [13:42:56] (03CR) 10Arnaudb: "this reproduces what was tested in 1186512" [puppet] - 10https://gerrit.wikimedia.org/r/1187810 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [13:44:30] (03PS1) 10Bking: opensearch-operator: point to correct operator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187815 (https://phabricator.wikimedia.org/T397246) [13:44:48] (03PS2) 10Bking: opensearch-operator: point to correct operator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187815 (https://phabricator.wikimedia.org/T397246) [13:47:44] (03CR) 10Bking: [C:03+2] opensearch-operator: point to correct operator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187815 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:48:08] (03CR) 10Bking: [C:03+2] "self-merging, as this service is not in production and won't work until corrected" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187815 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:49:15] (03PS3) 10Effie Mouzeli: P:hcatcha: set nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) [13:49:17] (03Merged) 10jenkins-bot: opensearch-operator: point to correct operator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187815 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:49:18] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [13:51:43] (03CR) 10Btullis: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [13:52:10] (03CR) 10Muehlenhoff: [C:03+2] Failover idp.w.o to idp1004 [dns] - 10https://gerrit.wikimedia.org/r/1187809 (owner: 10Muehlenhoff) [13:52:14] !log jmm@dns1004 START - running authdns-update [13:52:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T402763)', diff saved to https://phabricator.wikimedia.org/P83264 and previous config saved to /var/cache/conftool/dbconfig/20250912-135246-ladsgroup.json [13:52:51] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [13:53:02] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1177.eqiad.wmnet with reason: Maintenance [13:53:05] (03CR) 10Btullis: [C:03+1] dse-k8s: Define echoserver namespace for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187805 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [13:53:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1177 (T402763)', diff saved to https://phabricator.wikimedia.org/P83265 and previous config saved to /var/cache/conftool/dbconfig/20250912-135309-ladsgroup.json [13:53:20] (03CR) 10Vgutierrez: "ideally traffiserver timeout should be a little bit higher than nginx so ATS doesn't go away before nginx stops trying to get a response" [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [13:53:24] !log jmm@dns1004 END - running authdns-update [13:53:26] (03PS2) 10Stevemunene: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) [13:54:54] (03CR) 10CI reject: [V:04-1] dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [13:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:56:16] !log jmm@dns1004 START - running authdns-update [13:56:22] (03PS4) 10Effie Mouzeli: P:hcatcha: set nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) [13:57:23] !log jmm@dns1004 END - running authdns-update [13:57:54] (03PS5) 10Effie Mouzeli: P:hcatcha: set nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) [14:02:32] (03CR) 10Effie Mouzeli: "it is 2s lower than ATS, if we are ok, we merge" [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [14:03:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T402763)', diff saved to https://phabricator.wikimedia.org/P83266 and previous config saved to /var/cache/conftool/dbconfig/20250912-140348-ladsgroup.json [14:03:54] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [14:04:06] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:05:07] (03PS5) 10Ssingh: (WIP)P:haptcha: only listen to local addresses [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:05:48] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6910/co" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:06:40] FIRING: SystemdUnitFailed: imposm.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:07:45] (03CR) 10Vgutierrez: [C:04-1] (WIP)P:haptcha: only listen to local addresses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:10:34] (03CR) 10Vgutierrez: [C:03+1] P:hcatcha: set nginx timeouts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [14:11:17] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1235.eqiad.wmnet with OS bullseye [14:14:13] (03PS6) 10Ssingh: (WIP)P:haptcha: only listen to local addresses [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:14:53] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6911/co" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:15:54] (03PS6) 10Effie Mouzeli: P:hcaptcha: set nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) [14:15:55] (03CR) 10Ssingh: (WIP)P:haptcha: only listen to local addresses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:18:38] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: set nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1187778 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [14:18:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P83267 and previous config saved to /var/cache/conftool/dbconfig/20250912-141856-ladsgroup.json [14:18:58] FIRING: [3x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [14:19:53] (03PS7) 10Effie Mouzeli: P:haptcha: only listen to local addresses [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) [14:19:58] (03CR) 10Effie Mouzeli: [C:03+1] P:haptcha: only listen to local addresses [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:22:29] (03CR) 10JHathaway: [C:03+1] ssh: Disable X11 for the new-style sshd.d template [puppet] - 10https://gerrit.wikimedia.org/r/1187737 (https://phabricator.wikimedia.org/T400478) (owner: 10Muehlenhoff) [14:25:15] (03CR) 10Majavah: [C:03+1] "LGTM, cosmetic nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:26:25] (03PS8) 10Effie Mouzeli: P:haptcha: only listen to local addresses [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) [14:26:27] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:27:16] (03CR) 10Ssingh: P:haptcha: only listen to local addresses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:01] 06SRE, 10Infrastructure Security: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824#11176390 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff We can close this, there's no Stretch hosts left and even Buster is close... [14:31:27] (03PS9) 10Effie Mouzeli: P:hcaptcha: only listen to local addresses [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) [14:31:36] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for  - https://phabricator.wikimedia.org/T403695#11176396 (10CDobbins) 05Open→03Resolved p:05Triage→03Medium a:03CDobbins [14:34:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P83268 and previous config saved to /var/cache/conftool/dbconfig/20250912-143404-ladsgroup.json [14:34:11] (03CR) 10Effie Mouzeli: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:34:27] (03CR) 10Effie Mouzeli: [C:03+2] "thank you folks!" [puppet] - 10https://gerrit.wikimedia.org/r/1187751 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:34:32] (03CR) 10Brouberol: [C:03+1] dse-k8s: Define echoserver namespace for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187805 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [14:34:32] (03PS1) 10Majavah: openstack: nova: fullstack: Drop --ipv6 flag [puppet] - 10https://gerrit.wikimedia.org/r/1187821 [14:34:32] (03PS1) 10Majavah: openstack: nova: fullstack: Use Trixie image [puppet] - 10https://gerrit.wikimedia.org/r/1187822 [14:34:33] (03PS1) 10Majavah: openstack: nova: fullstack: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1187823 [14:35:26] (03CR) 10CI reject: [V:04-1] openstack: nova: fullstack: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1187823 (owner: 10Majavah) [14:36:50] (03PS2) 10Majavah: openstack: nova: fullstack: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1187823 [14:37:26] (03CR) 10CI reject: [V:04-1] openstack: nova: fullstack: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1187823 (owner: 10Majavah) [14:38:03] (03PS3) 10Majavah: openstack: nova: fullstack: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1187823 [14:39:08] (03PS1) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1187828 [14:40:01] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1235.eqiad.wmnet with OS bullseye [14:41:03] (03CR) 10Scott French: "Thanks, Jasmine!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1187544 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [14:41:09] (03CR) 10Ssingh: P:hcaptcha: add keepalive_timeout (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1187828 (owner: 10Effie Mouzeli) [14:41:44] (03CR) 10CI reject: [V:04-1] openstack: nova: fullstack: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1187823 (owner: 10Majavah) [14:43:25] (03PS1) 10Effie Mouzeli: P:haptcha: fix default_server [puppet] - 10https://gerrit.wikimedia.org/r/1187832 (https://phabricator.wikimedia.org/T404388) [14:43:41] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187832 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:43:56] (03CR) 10Ssingh: [C:03+1] P:haptcha: fix default_server [puppet] - 10https://gerrit.wikimedia.org/r/1187832 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:45:14] (03PS4) 10Majavah: openstack: nova: fullstack: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1187823 [14:45:26] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1236.eqiad.wmnet with OS bullseye [14:46:15] (03CR) 10CI reject: [V:04-1] P:haptcha: fix default_server [puppet] - 10https://gerrit.wikimedia.org/r/1187832 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:46:25] FIRING: SystemdUnitFailed: nginx.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:50] (03PS2) 10Effie Mouzeli: P:haptcha: fix default_server [puppet] - 10https://gerrit.wikimedia.org/r/1187832 (https://phabricator.wikimedia.org/T404388) [14:47:30] (03PS5) 10Majavah: openstack: nova: fullstack: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1187823 [14:47:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:49:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T402763)', diff saved to https://phabricator.wikimedia.org/P83269 and previous config saved to /var/cache/conftool/dbconfig/20250912-144911-ladsgroup.json [14:49:17] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [14:49:27] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1178.eqiad.wmnet with reason: Maintenance [14:49:33] (03CR) 10Effie Mouzeli: [C:03+2] P:haptcha: fix default_server [puppet] - 10https://gerrit.wikimedia.org/r/1187832 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [14:49:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1178 (T402763)', diff saved to https://phabricator.wikimedia.org/P83270 and previous config saved to /var/cache/conftool/dbconfig/20250912-144934-ladsgroup.json [14:50:07] (03CR) 10CI reject: [V:04-1] openstack: nova: fullstack: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1187823 (owner: 10Majavah) [14:52:22] (03PS6) 10Majavah: openstack: nova: fullstack: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1187823 [14:54:51] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404449 (10phaultfinder) 03NEW [14:55:16] (03CR) 10Vgutierrez: [C:03+1] P:cache:haproxy add is_datacenter Lua action [puppet] - 10https://gerrit.wikimedia.org/r/1182763 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [14:56:25] RESOLVED: SystemdUnitFailed: nginx.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:56:54] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - proxoid_4260: Servers urldownloader1003.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:56:56] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - proxoid_4260: Servers urldownloader1003.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:57:02] (03CR) 10Vgutierrez: [C:03+1] "looking good, just some indentation nits" [puppet] - 10https://gerrit.wikimedia.org/r/1182782 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [14:57:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T402763)', diff saved to https://phabricator.wikimedia.org/P83271 and previous config saved to /var/cache/conftool/dbconfig/20250912-145709-ladsgroup.json [14:57:14] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [14:57:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11176523 (10Jhancock.wm) @elukey correct, i hadn't set those up yet. How many of the servers do you want ips setup on? I'm still trying to leave some untouched... [15:01:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11176560 (10elukey) >>! In T392851#11176523, @Jhancock.wm wrote: > @elukey correct, i hadn't set those up yet. How many of the servers do you want ips setup on?... [15:02:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:03:56] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:07:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:10:11] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11176592 (10elukey) I found this very interesting github issue about how Sloth optimizes the calculation of the error budget: https://github.com/slok/sloth/issues/618. I didn't get if this will a problem for... [15:12:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P83272 and previous config saved to /var/cache/conftool/dbconfig/20250912-151216-ladsgroup.json [15:15:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:20:06] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.633 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:20:59] !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad [15:22:12] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11176615 (10elukey) ==== Error Budget calculations - Calendar ===== We need a calendar window of 90 days corresponding to every SLO quarter as minimal requirement, and possibly a rolling window to base aler... [15:27:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P83274 and previous config saved to /var/cache/conftool/dbconfig/20250912-152724-ladsgroup.json [15:28:58] FIRING: [2x] CertAlmostExpired: Certificate for service lsw1-f1-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:35:02] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404449#11176650 (10phaultfinder) [15:38:28] (03PS1) 10Bking: admin_ng: Allow access to opensearch CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187847 (https://phabricator.wikimedia.org/T397246) [15:42:32] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T402763)', diff saved to https://phabricator.wikimedia.org/P83275 and previous config saved to /var/cache/conftool/dbconfig/20250912-154231-ladsgroup.json [15:42:37] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [15:42:47] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1192.eqiad.wmnet with reason: Maintenance [15:42:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1192 (T402763)', diff saved to https://phabricator.wikimedia.org/P83276 and previous config saved to /var/cache/conftool/dbconfig/20250912-154253-ladsgroup.json [15:43:28] !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad [15:43:57] !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw [15:45:38] (03CR) 10CI reject: [V:04-1] admin_ng: Allow access to opensearch CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187847 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [15:47:57] (03PS2) 10Bking: admin_ng: Allow access to opensearch CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187847 (https://phabricator.wikimedia.org/T397246) [15:50:11] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404449#11176739 (10phaultfinder) [15:50:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T402763)', diff saved to https://phabricator.wikimedia.org/P83277 and previous config saved to /var/cache/conftool/dbconfig/20250912-155018-ladsgroup.json [15:50:24] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [15:51:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:52:51] (03CR) 10Herron: [C:03+1] nsca_frack.cfg.erb create hostgroup fundraising-minio adding check-minio [puppet] - 10https://gerrit.wikimedia.org/r/1186566 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen) [15:53:59] FIRING: [3x] CertAlmostExpired: Certificate for service cloudsw1-c8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:58:58] FIRING: [4x] CertAlmostExpired: Certificate for service cloudsw1-c8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:59:18] (03CR) 10BCornwall: [V:03+1 C:03+2] acme-chief: Move clean-stale-certs to file [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [16:01:14] (03CR) 10Btullis: admin_ng: Allow access to opensearch CRDs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187847 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [16:01:39] !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw [16:03:58] FIRING: [5x] CertAlmostExpired: Certificate for service cloudsw1-c8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:05:13] (03PS4) 10Andrew Bogott: Ceph rbd: remove option to use 'civetweb' front-end [puppet] - 10https://gerrit.wikimedia.org/r/1186649 [16:05:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P83278 and previous config saved to /var/cache/conftool/dbconfig/20250912-160526-ladsgroup.json [16:06:19] (03CR) 10Btullis: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [16:07:43] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186649 (owner: 10Andrew Bogott) [16:08:58] FIRING: [6x] CertAlmostExpired: Certificate for service cloudsw1-c8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:09:44] 06SRE, 06Traffic: Setting up Wikimedia Trust and Safety Help Center with Zendesk product: Seeking Guidance on host mapping - https://phabricator.wikimedia.org/T400952#11176819 (10BCornwall) 05Open→03In progress p:05Triage→03Low a:03BCornwall [16:13:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:15:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:15:12] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187854 [16:15:41] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1236.eqiad.wmnet with OS bullseye [16:20:10] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.538 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:20:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P83279 and previous config saved to /var/cache/conftool/dbconfig/20250912-162033-ladsgroup.json [16:20:44] (03PS1) 10Ahmon Dancy: Add traindev-staging environment for mw-web and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350) [16:21:30] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 3.699 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:35:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:35:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T402763)', diff saved to https://phabricator.wikimedia.org/P83280 and previous config saved to /var/cache/conftool/dbconfig/20250912-163541-ladsgroup.json [16:35:47] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [16:35:58] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1203.eqiad.wmnet with reason: Maintenance [16:36:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1203 (T402763)', diff saved to https://phabricator.wikimedia.org/P83281 and previous config saved to /var/cache/conftool/dbconfig/20250912-163605-ladsgroup.json [16:38:53] !log Manually running clean-stale-certs.service on acmechief2002 - T399419 [16:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:58] T399419: Warning about /etc/acmecerts/unified contents during puppet run on deployment-cache-text08 & deployment-cache-upload08 - https://phabricator.wikimedia.org/T399419 [16:42:02] (03PS1) 10BCornwall: Move SPDX identifier below shebang [puppet] - 10https://gerrit.wikimedia.org/r/1187861 (https://phabricator.wikimedia.org/T399419) [16:43:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T402763)', diff saved to https://phabricator.wikimedia.org/P83283 and previous config saved to /var/cache/conftool/dbconfig/20250912-164324-ladsgroup.json [16:43:29] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [16:47:35] (03PS2) 10BCornwall: acme-chief: Fixes for cert cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/1187861 (https://phabricator.wikimedia.org/T399419) [16:48:04] RECOVERY - Check unit status of clean-stale-certs on acmechief2002 is OK: OK: Status of the systemd unit clean-stale-certs https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:51:17] (03PS3) 10BCornwall: acme-chief: Fixes for cert cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/1187861 (https://phabricator.wikimedia.org/T399419) [16:55:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:55:12] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6912/co" [puppet] - 10https://gerrit.wikimedia.org/r/1187861 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [16:58:22] (03CR) 10Ssingh: [C:03+1] acme-chief: Fixes for cert cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/1187861 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [16:58:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P83284 and previous config saved to /var/cache/conftool/dbconfig/20250912-165832-ladsgroup.json [17:00:06] (03PS1) 10Dzahn: add 'tok' language code - Toki Pona [dns] - 10https://gerrit.wikimedia.org/r/1187864 (https://phabricator.wikimedia.org/T404457) [17:02:39] (03PS3) 10Bking: admin_ng: Allow access to opensearch custom resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187847 (https://phabricator.wikimedia.org/T397246) [17:02:51] (03CR) 10Bking: admin_ng: Allow access to opensearch custom resources (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187847 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:05:57] (03CR) 10BCornwall: [V:03+1 C:03+2] acme-chief: Fixes for cert cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/1187861 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [17:13:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P83285 and previous config saved to /var/cache/conftool/dbconfig/20250912-171339-ladsgroup.json [17:15:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:16:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:26:30] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.449 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:28:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T402763)', diff saved to https://phabricator.wikimedia.org/P83286 and previous config saved to /var/cache/conftool/dbconfig/20250912-172847-ladsgroup.json [17:28:54] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [17:29:02] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1209.eqiad.wmnet with reason: Maintenance [17:29:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1209 (T402763)', diff saved to https://phabricator.wikimedia.org/P83287 and previous config saved to /var/cache/conftool/dbconfig/20250912-172909-ladsgroup.json [17:31:13] (03PS1) 10Jasmine: spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) [17:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:34:49] (03CR) 10Bking: "It's my understanding based on https://wikitech.wikimedia.org/wiki/Envoy#Services_Proxy that clients get TLS "for free" when pointing to t" [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) (owner: 10Btullis) [17:35:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:35:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T402763)', diff saved to https://phabricator.wikimedia.org/P83288 and previous config saved to /var/cache/conftool/dbconfig/20250912-173542-ladsgroup.json [17:35:47] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [17:36:57] (03CR) 10Btullis: [C:03+1] admin_ng: Allow access to opensearch custom resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187847 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:41:10] (03CR) 10CI reject: [V:04-1] spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [17:43:47] (03PS8) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) [17:44:53] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404449#11177196 (10phaultfinder) [17:46:44] (03CR) 10Bking: [C:03+2] admin_ng: Allow access to opensearch custom resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187847 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:48:28] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:48:53] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:49:04] !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [17:49:22] !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [17:50:38] (03CR) 10Scott French: "Thanks, Jasmine!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [17:50:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P83289 and previous config saved to /var/cache/conftool/dbconfig/20250912-175049-ladsgroup.json [17:55:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:57:26] (03PS9) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) [17:58:03] (03PS10) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) [17:59:12] (03PS11) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) [18:04:06] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:05:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P83290 and previous config saved to /var/cache/conftool/dbconfig/20250912-180557-ladsgroup.json [18:06:40] FIRING: SystemdUnitFailed: imposm.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:09:42] (03PS1) 10Krinkle: varnish: Assert whether Commons/Googlebot gets desktop or mobile HTML [puppet] - 10https://gerrit.wikimedia.org/r/1187875 (https://phabricator.wikimedia.org/T403510) [18:09:43] (03PS1) 10Krinkle: varnish: Switch Commons/Googlebot pilot from desktop to unified mobile [puppet] - 10https://gerrit.wikimedia.org/r/1187876 (https://phabricator.wikimedia.org/T397267) [18:11:41] (03CR) 10Dzahn: [C:03+2] add 'tok' language code - Toki Pona [dns] - 10https://gerrit.wikimedia.org/r/1187864 (https://phabricator.wikimedia.org/T404457) (owner: 10Dzahn) [18:11:58] !log dzahn@dns1004 START - running authdns-update [18:13:11] !log dzahn@dns1004 END - running authdns-update [18:14:23] !log DNS - added new project language 'tok' (tok.wikipedia.org) (Toki Pona) https://en.wikipedia.org/wiki/Toki_Pona - T404457 [18:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:32] T404457: Create Wikipedia Toki Pona - https://phabricator.wikimedia.org/T404457 [18:18:58] FIRING: [3x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [18:20:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:21:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T402763)', diff saved to https://phabricator.wikimedia.org/P83291 and previous config saved to /var/cache/conftool/dbconfig/20250912-182104-ladsgroup.json [18:21:10] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [18:21:20] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1214.eqiad.wmnet with reason: Maintenance [18:21:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1214 (T402763)', diff saved to https://phabricator.wikimedia.org/P83292 and previous config saved to /var/cache/conftool/dbconfig/20250912-182126-ladsgroup.json [18:25:18] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11177276 (10KFrancis) Hi all, the NDA has been sent out for signatures. I'll confirm when it's complete. [18:27:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T402763)', diff saved to https://phabricator.wikimedia.org/P83293 and previous config saved to /var/cache/conftool/dbconfig/20250912-182752-ladsgroup.json [18:27:57] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [18:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:43:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P83294 and previous config saved to /var/cache/conftool/dbconfig/20250912-184259-ladsgroup.json [18:44:40] (03PS2) 10Jasmine: spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) [18:45:04] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 1.803 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:46:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 9.123 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:46:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:47:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:50:24] (03PS1) 10Dzahn: peopleweb: make people2004 the new rsync source [puppet] - 10https://gerrit.wikimedia.org/r/1187884 (https://phabricator.wikimedia.org/T402596) [18:50:45] (03CR) 10CI reject: [V:04-1] peopleweb: make people2004 the new rsync source [puppet] - 10https://gerrit.wikimedia.org/r/1187884 (https://phabricator.wikimedia.org/T402596) (owner: 10Dzahn) [18:53:02] (03CR) 10CI reject: [V:04-1] spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [18:54:03] (03PS1) 10Dzahn: switch people service aliases in eqiad and codfw to new trixie hosts [dns] - 10https://gerrit.wikimedia.org/r/1187885 (https://phabricator.wikimedia.org/T402596) [18:56:11] (03PS2) 10Dzahn: switch people service aliases in eqiad and codfw to new trixie hosts [dns] - 10https://gerrit.wikimedia.org/r/1187885 (https://phabricator.wikimedia.org/T402596) [18:56:30] (03PS2) 10Dzahn: peopleweb: make people2004 the new rsync source [puppet] - 10https://gerrit.wikimedia.org/r/1187884 (https://phabricator.wikimedia.org/T402596) [18:56:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:58:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P83295 and previous config saved to /var/cache/conftool/dbconfig/20250912-185807-ladsgroup.json [18:58:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:03:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:04:15] (03PS4) 10Andrea Denisse: alert: Add Slack route to send Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) [19:10:56] 10ops-codfw, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T404472 (10phaultfinder) 03NEW [19:13:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T402763)', diff saved to https://phabricator.wikimedia.org/P83296 and previous config saved to /var/cache/conftool/dbconfig/20250912-191314-ladsgroup.json [19:13:20] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [19:13:31] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1226.eqiad.wmnet with reason: Maintenance [19:13:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1226 (T402763)', diff saved to https://phabricator.wikimedia.org/P83297 and previous config saved to /var/cache/conftool/dbconfig/20250912-191338-ladsgroup.json [19:15:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:20:10] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.558 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:20:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T402763)', diff saved to https://phabricator.wikimedia.org/P83298 and previous config saved to /var/cache/conftool/dbconfig/20250912-192015-ladsgroup.json [19:20:21] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [19:23:34] (03CR) 10Andrea Denisse: "Hi folks, I updated the Wikitech documentation and included some examples, I'd greatly appreciate your feedback on both the patch and the " [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [19:23:58] (03PS2) 10Krinkle: varnish: Switch Commons/Googlebot pilot from desktop to unified mobile [puppet] - 10https://gerrit.wikimedia.org/r/1187876 (https://phabricator.wikimedia.org/T397267) [19:26:36] (03PS3) 10Jasmine: spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) [19:28:07] (03PS2) 10Krinkle: varnish: Assert whether Commons/Googlebot gets desktop or mobile HTML [puppet] - 10https://gerrit.wikimedia.org/r/1187875 (https://phabricator.wikimedia.org/T397267) [19:28:09] (03PS3) 10Krinkle: varnish: Switch Commons/Googlebot pilot from desktop to unified mobile [puppet] - 10https://gerrit.wikimedia.org/r/1187876 (https://phabricator.wikimedia.org/T397267) [19:31:11] (03PS4) 10Krinkle: varnish: Switch Commons/Googlebot pilot from desktop to unified mobile [puppet] - 10https://gerrit.wikimedia.org/r/1187876 (https://phabricator.wikimedia.org/T397267) [19:35:17] (03CR) 10CI reject: [V:04-1] spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [19:35:22] (03PS3) 10Krinkle: varnish: Assert whether Commons/Googlebot gets desktop or mobile HTML [puppet] - 10https://gerrit.wikimedia.org/r/1187875 (https://phabricator.wikimedia.org/T397267) [19:35:22] (03PS5) 10Krinkle: varnish: Switch Commons/Googlebot pilot from desktop to unified mobile [puppet] - 10https://gerrit.wikimedia.org/r/1187876 (https://phabricator.wikimedia.org/T397267) [19:35:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P83299 and previous config saved to /var/cache/conftool/dbconfig/20250912-193523-ladsgroup.json [19:38:02] (03PS4) 10Krinkle: varnish: Assert whether Commons/Googlebot gets desktop or mobile HTML [puppet] - 10https://gerrit.wikimedia.org/r/1187875 (https://phabricator.wikimedia.org/T397267) [19:38:03] (03PS6) 10Krinkle: varnish: Switch Commons/Googlebot pilot from desktop to unified mobile [puppet] - 10https://gerrit.wikimedia.org/r/1187876 (https://phabricator.wikimedia.org/T397267) [19:41:20] (03PS1) 10Majavah: P:toolforge::proxy: Limit in-flight connections per tool [puppet] - 10https://gerrit.wikimedia.org/r/1187892 (https://phabricator.wikimedia.org/T404471) [19:42:07] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6920/co" [puppet] - 10https://gerrit.wikimedia.org/r/1187892 (https://phabricator.wikimedia.org/T404471) (owner: 10Majavah) [19:43:14] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1187892 (https://phabricator.wikimedia.org/T404471) (owner: 10Majavah) [19:45:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:46:34] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::proxy: Limit in-flight connections per tool [puppet] - 10https://gerrit.wikimedia.org/r/1187892 (https://phabricator.wikimedia.org/T404471) (owner: 10Majavah) [19:50:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P83300 and previous config saved to /var/cache/conftool/dbconfig/20250912-195030-ladsgroup.json [19:52:37] (03PS1) 10Majavah: P:toolforge::proxy: Allow throttled tools to load error page assets [puppet] - 10https://gerrit.wikimedia.org/r/1187894 (https://phabricator.wikimedia.org/T404471) [19:53:50] (03CR) 10David Caro: [C:03+1] P:toolforge::proxy: Allow throttled tools to load error page assets [puppet] - 10https://gerrit.wikimedia.org/r/1187894 (https://phabricator.wikimedia.org/T404471) (owner: 10Majavah) [19:55:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:55:23] (03CR) 10Majavah: [C:03+2] P:toolforge::proxy: Allow throttled tools to load error page assets [puppet] - 10https://gerrit.wikimedia.org/r/1187894 (https://phabricator.wikimedia.org/T404471) (owner: 10Majavah) [19:57:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:58:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:02:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:04:52] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:04:52] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:05:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T402763)', diff saved to https://phabricator.wikimedia.org/P83301 and previous config saved to /var/cache/conftool/dbconfig/20250912-200538-ladsgroup.json [20:05:43] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [20:05:48] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:05:52] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:05:53] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [20:08:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:08:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:08:58] FIRING: [6x] CertAlmostExpired: Certificate for service cloudsw1-c8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:10:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:11:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:13:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:13:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:13:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:20:50] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1160* gradually with 4 steps - Work done [20:24:56] (03PS1) 10SD0001: maintain-views: fix filtering for actor view [puppet] - 10https://gerrit.wikimedia.org/r/1187896 (https://phabricator.wikimedia.org/T404473) [20:28:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:33:15] (03PS10) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [20:34:38] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:35:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:36:45] (03CR) 10Dzahn: [C:03+2] zuul::executor: create systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1187531 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [20:42:16] (03PS1) 10Dzahn: zuul::executor: fix name of systemd service template [puppet] - 10https://gerrit.wikimedia.org/r/1187899 (https://phabricator.wikimedia.org/T403847) [20:43:10] (03PS2) 10Dzahn: zuul::executor: fix name of systemd service template [puppet] - 10https://gerrit.wikimedia.org/r/1187899 (https://phabricator.wikimedia.org/T403847) [20:44:04] (03PS1) 10Bking: opensearch-operator: remove unnecessary ClusterRoles from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187900 (https://phabricator.wikimedia.org/T397246) [20:46:03] (03PS2) 10Bking: opensearch-operator: remove unnecessary ClusterRoles from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187900 (https://phabricator.wikimedia.org/T397246) [20:46:30] (03CR) 10Dzahn: [C:03+2] zuul::executor: fix name of systemd service template [puppet] - 10https://gerrit.wikimedia.org/r/1187899 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [20:51:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:55:28] (03PS1) 10Dzahn: zuul::executor: fix service description and path in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1187901 (https://phabricator.wikimedia.org/T403847) [20:55:50] (03CR) 10CI reject: [V:04-1] zuul::executor: fix service description and path in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1187901 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [20:58:29] (03PS2) 10Dzahn: zuul::executor: fix service description and path in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1187901 (https://phabricator.wikimedia.org/T403847) [20:58:49] (03CR) 10Dzahn: [C:03+2] zuul::executor: fix service description and path in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1187901 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [21:05:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:19:41] (03CR) 10Btullis: [C:03+1] opensearch-operator: remove unnecessary ClusterRoles from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187900 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:25:04] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on zuul2002.codfw.wmnet with reason: in setup [21:25:29] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on zuul1002.eqiad.wmnet with reason: in setup [21:26:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:31:42] 10ops-codfw, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2324:9290 - https://phabricator.wikimedia.org/T404480 (10phaultfinder) 03NEW [21:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:50:10] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.318 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:55:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:04:06] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:05:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:06:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 1.439 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:06:40] FIRING: SystemdUnitFailed: imposm.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:59] FIRING: [3x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [22:21:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:25:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:26:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:55:13] (03PS7) 10Krinkle: varnish: Switch Commons/Googlebot pilot from desktop to unified mobile [puppet] - 10https://gerrit.wikimedia.org/r/1187876 (https://phabricator.wikimedia.org/T397267) [22:56:34] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:01:19] (03CR) 10Krinkle: [C:04-1] "This is meant to fail the test added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1187875/ specifically `expect req.http.X-Subd" [puppet] - 10https://gerrit.wikimedia.org/r/1187876 (https://phabricator.wikimedia.org/T397267) (owner: 10Krinkle) [23:01:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:10:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:15:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:16:44] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T403965#11178001 (10Papaul) 05Open→03Resolved [23:17:41] 10ops-codfw, 06SRE, 06DC-Ops: codfw: document SCS ports in Netbox - https://phabricator.wikimedia.org/T403634#11178004 (10Papaul) 05Open→03Resolved Complete [23:19:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11178007 (10Papaul) Juniper shipped out a new PEM to replace with PEM0 and see if that will fix the issue. [23:25:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:25:43] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11178012 (10vaughnwalters) I made a note of this when testing the campaign events extension T404244#11177992, but also wanted to bring this up here becaus... [23:30:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:35:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11178019 (10Papaul) @VRiley-WMF the issue is that es1056 is missing in the this patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1172182/1/modules/profile/data/profil...