[00:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:06:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:06:40] FIRING: [2x] SystemdUnitFailed: imposm.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:26:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.327 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:49:14] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:59:14] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:01:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:06:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:07:24] 10ops-eqiad, 06DC-Ops: Unresponsive management for ms-be1068.mgmt:22 - https://phabricator.wikimedia.org/T404535 (10phaultfinder) 03NEW [01:20:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:30:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:55:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:56:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:00:08] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.566 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:01:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 7.666 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:04:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:13:59] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:30:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:31:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:45:10] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.770 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:46:30] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.861 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:05:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:06:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:23:59] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:28:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from miscweb.discovery.wmnet in magru #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=magru&var-cluster=text&var-origin=miscweb.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:33:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from miscweb.discovery.wmnet in magru #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=magru&var-cluster=text&var-origin=miscweb.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:56:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 6.216 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:05:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:06:40] FIRING: [2x] SystemdUnitFailed: imposm.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:20:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:31:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:49:14] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:50:06] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.545 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:54:14] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:59:14] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:08:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 6.640 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:33:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:36:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:38:59] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:30] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 3.590 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:54:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:59:15] RESOLVED: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:04:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:13:59] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:15:42] I think i broke something, and then I fixed it [06:20:13] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:25:03] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:51:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:53:39] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11179665 (10tstarling) My suspicion is that it's slow because it's inefficient. That's why I asked for stack traces. P... [06:55:42] !log restarted atftpd on install1004 [06:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:15] RECOVERY - TFTP service on install1004 is OK: PROCS OK: 1 process with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [06:56:29] !log reindex gis database on maps1011 following initial OSM import T381565 [06:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:33] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [07:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T0700). nyaa~ [07:00:05] phuedx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:25] FIRING: [3x] SystemdUnitFailed: imposm.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:10:18] the last gerrit update in here from wikibugs is from friday, I'll have a look and possibly restart at least the gerrit job [07:11:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:15:14] ^ imposm is expected, the host is in setup, I'll extend downtime for a day [07:15:20] k [07:16:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:17:07] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging jwheeler out of all services on: 2418 hosts [07:18:34] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1011.eqiad.wmnet with reason: in setup [07:19:45] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1183100 (https://phabricator.wikimedia.org/T402406) (owner: 10Muehlenhoff) [07:19:53] ok now is back working [07:23:25] (03PS1) 10Slyngshede: data.yaml offboarding for jly [puppet] - 10https://gerrit.wikimedia.org/r/1188139 [07:23:59] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:26:37] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1188139 (owner: 10Slyngshede) [07:27:05] (03CR) 10Slyngshede: [C:03+2] data.yaml offboarding for jly [puppet] - 10https://gerrit.wikimedia.org/r/1188139 (owner: 10Slyngshede) [07:28:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184480 (https://phabricator.wikimedia.org/T395674) (owner: 10Joely Rooke WMDE) [07:34:19] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11179746 (10ECohen_WMDE) >>! In T404359#11174138, @CDobbins wrote: > @ECohen_WMDE, how do you want to do the public key confirmation? One of the most common methods is to put your pubkey... [07:35:28] (03CR) 10Fabfur: [C:03+1] P:cache:haproxy add datacenter information to provenance [puppet] - 10https://gerrit.wikimedia.org/r/1182782 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:37:09] (03CR) 10Fabfur: [C:03+1] P:cache:haproxy add is_datacenter Lua action [puppet] - 10https://gerrit.wikimedia.org/r/1182763 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:39:39] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jly out of all services on: 2418 hosts [07:44:44] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jly out of all services on: 2418 hosts [07:46:29] o/ Sorry I'm late [07:46:32] jouncebot now [07:46:32] For the next 0 hour(s) and 13 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T0700) [07:46:38] jouncebot next [07:46:38] In 2 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1000) [07:50:25] (03CR) 10Filippo Giunchedi: wmcs: port ::instance to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [07:52:05] (03CR) 10Stevemunene: [C:03+2] dse-k8s: Define echoserver namespace for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187805 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [07:53:50] Any objections to me deploying the NOP config change that I had scheduled? It's a little late in the window [07:59:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187382 (https://phabricator.wikimedia.org/T400068) (owner: 10Phuedx) [08:00:19] (03Merged) 10jenkins-bot: dse-k8s: Define echoserver namespace for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187805 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [08:00:55] (03Merged) 10jenkins-bot: WikimediaEvents: Disable client-side error logging for certain wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187382 (https://phabricator.wikimedia.org/T400068) (owner: 10Phuedx) [08:00:58] (03PS3) 10Stevemunene: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) [08:02:49] (03CR) 10CI reject: [V:04-1] dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [08:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:03:56] scap apply-patches failed: https://spiderpig.wikimedia.org/jobs/549 [08:04:03] Checking the state of the deployment host [08:04:28] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:07:21] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:11:51] (03CR) 10Muehlenhoff: [C:03+2] imposm-initial-import: Add the reindex step to the script [puppet] - 10https://gerrit.wikimedia.org/r/1187660 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:14:14] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:15:53] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:16:51] (03CR) 10Muehlenhoff: [C:03+2] ssh: Disable X11 for the new-style sshd.d template [puppet] - 10https://gerrit.wikimedia.org/r/1187737 (https://phabricator.wikimedia.org/T400478) (owner: 10Muehlenhoff) [08:17:51] I think the deployment host is in the correct state but the next deployment will likely fail for the same reason [08:18:03] I'll revert my config change [08:19:37] (03PS1) 10Phuedx: Revert "WikimediaEvents: Disable client-side error logging for certain wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188281 [08:28:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188281 (owner: 10Phuedx) [08:28:52] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:29:00] (03Merged) 10jenkins-bot: Revert "WikimediaEvents: Disable client-side error logging for certain wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188281 (owner: 10Phuedx) [08:33:52] operations/mediawiki-config is back to its initial state. jnuche has pinged folks [08:42:34] (03CR) 10Muehlenhoff: [C:03+2] Enable the regular imports of the OSM updates and water lines on maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1186987 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:43:25] (03Abandoned) 10Muehlenhoff: Enable the regular imports of the OSM updates and water lines on maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1186987 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:43:39] (03CR) 10Muehlenhoff: [C:03+2] Make maps2012-2014 replica nodes [puppet] - 10https://gerrit.wikimedia.org/r/1187457 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:46:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:48:17] (03CR) 10Slyngshede: [C:03+2] P:cache:haproxy add is_datacenter Lua action [puppet] - 10https://gerrit.wikimedia.org/r/1182763 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:51:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:54:22] !log btullis@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [08:55:06] (03CR) 10Filippo Giunchedi: "Ok I did some testing and this is what I found:" [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [08:55:53] (03PS1) 10Effie Mouzeli: P:hcaptcha: fix LVS monitoring endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1188284 [08:56:41] (03PS2) 10Effie Mouzeli: P:hcaptcha: fix LVS monitoring endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1188284 [08:57:52] jouncebot: nowandnext [08:57:52] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [08:57:52] In 1 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1000) [08:57:57] cool [08:58:22] (03CR) 10Elukey: [C:03+1] "Worth to test in my opinion!" [puppet] - 10https://gerrit.wikimedia.org/r/1188284 (owner: 10Effie Mouzeli) [09:01:05] (03CR) 10Majavah: "I wonder if the way to go would be to edit `firewall::service` to apply both ferm and nftables resources unconditionally (removing the `in" [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [09:01:25] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:16] (03PS6) 10Slyngshede: P:cache:haproxy add datacenter information to provenance [puppet] - 10https://gerrit.wikimedia.org/r/1182782 (https://phabricator.wikimedia.org/T398161) [09:07:43] (03CR) 10Filippo Giunchedi: [C:03+2] profile: ship Cloud VPS root authorized-keys [puppet] - 10https://gerrit.wikimedia.org/r/1187757 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [09:08:36] (03CR) 10Slyngshede: P:cache:haproxy add datacenter information to provenance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1182782 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:09:19] (03PS1) 10Ladsgroup: Reduce db lock timeout in LinksUpdate and CategoryMembershipChangeJob [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188285 (https://phabricator.wikimedia.org/T366938) [09:09:28] (03CR) 10Ladsgroup: [C:03+2] Reduce db lock timeout in LinksUpdate and CategoryMembershipChangeJob [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188285 (https://phabricator.wikimedia.org/T366938) (owner: 10Ladsgroup) [09:13:34] !log stevemunene@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster dse-codfw: Cleanup the dse-k8s-codfw cluster [09:15:19] (03PS1) 10Filippo Giunchedi: passwords: root authorized-keys has moved to puppet.git [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) [09:16:16] (03PS1) 10Ladsgroup: lists: Bump number of uwsgi processes to 12 (from 4) [puppet] - 10https://gerrit.wikimedia.org/r/1188288 (https://phabricator.wikimedia.org/T353891) [09:17:36] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188288 (https://phabricator.wikimedia.org/T353891) (owner: 10Ladsgroup) [09:19:41] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [09:19:57] (03PS2) 10Lucas Werkmeister (WMDE): Remove feature flag to resolve changelist wikibase link labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184480 (https://phabricator.wikimedia.org/T395674) (owner: 10Joely Rooke WMDE) [09:20:04] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [09:20:13] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:20:37] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/linkrecommendation: sync [09:20:57] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/linkrecommendation: sync [09:21:32] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [09:22:19] stevemunene@cumin1003 wipe-cluster (PID 2654089) is awaiting input [09:23:32] FIRING: [2x] KubernetesCalicoDown: dse-k8s-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:23:38] (03CR) 10Arnaudb: [C:03+1] "looks good to me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1188288 (https://phabricator.wikimedia.org/T353891) (owner: 10Ladsgroup) [09:23:52] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:24:08] (03PS2) 10Ladsgroup: lists: Bump number of uwsgi processes to 12 (from 4) [puppet] - 10https://gerrit.wikimedia.org/r/1188288 (https://phabricator.wikimedia.org/T353891) [09:24:15] (03CR) 10Ladsgroup: [V:03+2 C:03+2] lists: Bump number of uwsgi processes to 12 (from 4) [puppet] - 10https://gerrit.wikimedia.org/r/1188288 (https://phabricator.wikimedia.org/T353891) (owner: 10Ladsgroup) [09:24:38] (03Merged) 10jenkins-bot: Reduce db lock timeout in LinksUpdate and CategoryMembershipChangeJob [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188285 (https://phabricator.wikimedia.org/T366938) (owner: 10Ladsgroup) [09:25:05] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:25:56] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [09:27:40] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-ctrl2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:28:04] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [09:28:32] RESOLVED: [4x] KubernetesCalicoDown: dse-k8s-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:30:07] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/termbox: sync [09:30:09] !log stopping puppet on A:lvs-low-traffic-eqiad and A:lvs-low-traffic-codfw [09:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:12] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.k8s.wipe-cluster (exit_code=0) Wipe the K8s cluster dse-codfw: Cleanup the dse-k8s-codfw cluster [09:30:18] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/termbox: sync [09:31:32] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [09:31:46] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [09:32:12] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [09:32:40] RESOLVED: KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-ctrl2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:33:06] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: sync [09:33:13] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: sync [09:33:48] !log ladsgroup@deploy1003 Started scap sync-world: Backport: [[gerrit:1188285|Reduce db lock timeout in LinksUpdate and CategoryMembershipChangeJob]] (T366938) [09:33:52] T366938: Reduce relying on database locks - https://phabricator.wikimedia.org/T366938 [09:33:57] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [09:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:34:05] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [09:34:25] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: sync [09:34:32] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: sync [09:37:31] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: sync [09:37:57] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [09:38:09] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: fix LVS monitoring endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1188284 (owner: 10Effie Mouzeli) [09:39:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:39:55] PROBLEM - mailman3-web on lists1004 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:40:57] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [09:41:40] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [09:44:16] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:46:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:47:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad [09:50:51] (03PS1) 10Jcrespo: mailman: Update monitoring to 13 mailman processes [puppet] - 10https://gerrit.wikimedia.org/r/1188294 (https://phabricator.wikimedia.org/T353891) [09:52:14] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/linkrecommendation: sync [09:53:20] (03CR) 10CI reject: [V:04-1] mailman: Update monitoring to 13 mailman processes [puppet] - 10https://gerrit.wikimedia.org/r/1188294 (https://phabricator.wikimedia.org/T353891) (owner: 10Jcrespo) [09:53:22] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: sync [09:54:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:54:47] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: sync [09:55:28] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: sync [09:56:16] jouncebot: nowandnext [09:56:16] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [09:56:16] In 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1000) [09:56:43] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/termbox: sync [09:57:25] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/termbox: sync [09:57:42] 06SRE, 10DNS, 06Traffic, 07Language codes: Redirect legacy language codes for Toki Pona to tok.wikipedia.org - https://phabricator.wikimedia.org/T404507#11180272 (10Tamzin) There are any number of historical links, on-wiki, on the mailing lists, and indeed on Phabricator, referring to tokipona.wikipedia.or... [09:58:59] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:58:59] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1000) [10:02:06] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: sync [10:02:15] (03CR) 10Ladsgroup: [C:03+1] spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [10:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:04:35] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: sync [10:09:58] !log jiji@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs [10:10:29] (03PS2) 10Jcrespo: mailman: Update monitoring to 13 mailman processes [puppet] - 10https://gerrit.wikimedia.org/r/1188294 (https://phabricator.wikimedia.org/T353891) [10:13:31] (03CR) 10Ladsgroup: [C:03+1] "Thanks for catching it!" [puppet] - 10https://gerrit.wikimedia.org/r/1188294 (https://phabricator.wikimedia.org/T353891) (owner: 10Jcrespo) [10:13:48] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188294 (https://phabricator.wikimedia.org/T353891) (owner: 10Jcrespo) [10:13:59] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:14:54] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1160* gradually with 4 steps - Work done [10:15:54] !log ladsgroup@deploy1003 Finished scap sync-world: Backport: [[gerrit:1188285|Reduce db lock timeout in LinksUpdate and CategoryMembershipChangeJob]] (T366938) (duration: 42m 06s) [10:15:54] (03PS1) 10Ladsgroup: db1160: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1188299 [10:15:58] T366938: Reduce relying on database locks - https://phabricator.wikimedia.org/T366938 [10:16:55] (03CR) 10Jcrespo: [C:03+2] mailman: Update monitoring to 13 mailman processes [puppet] - 10https://gerrit.wikimedia.org/r/1188294 (https://phabricator.wikimedia.org/T353891) (owner: 10Jcrespo) [10:17:27] (03PS2) 10Ladsgroup: db1160: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1188299 [10:17:33] (03CR) 10Ladsgroup: [V:03+2 C:03+2] db1160: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1188299 (owner: 10Ladsgroup) [10:17:43] !log jiji@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs [10:18:55] RECOVERY - mailman3-web on lists1004 is OK: PROCS OK: 13 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:20:04] !log btullis@cumin1003 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [10:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:30:17] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - proxoid_4260: Servers urldownloader1003.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:31:57] (03PS1) 10JMeybohm: haproxy ipblocks-all: Filter disabled ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/1188300 (https://phabricator.wikimedia.org/T402014) [10:34:27] (03PS1) 10JMeybohm: Add 'enabled' and 'last_modified' fields to ipblock schema [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1188301 [10:34:41] (03CR) 10JMeybohm: [V:03+2 C:03+2] Add 'enabled' and 'last_modified' fields to ipblock schema [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1188301 (owner: 10JMeybohm) [10:36:48] FIRING: PuppetFailure: Puppet has failed on maps2012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:38:00] (03PS1) 10Effie Mouzeli: Revert "P:hcaptcha: fix LVS monitoring endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/1188302 [10:38:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [10:39:17] !log jayme@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin1002" [10:39:18] !log jayme@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin1002 [10:40:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin1002 [10:40:04] !log jayme@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin1002" [10:40:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:43:09] (03CR) 10Elukey: [C:03+1] Revert "P:hcaptcha: fix LVS monitoring endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/1188302 (owner: 10Effie Mouzeli) [10:44:13] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2207.codfw.wmnet with reason: Maintenance [10:44:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2207 (T402925)', diff saved to https://phabricator.wikimedia.org/P83305 and previous config saved to /var/cache/conftool/dbconfig/20250915-104420-ladsgroup.json [10:44:26] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:45:06] (03CR) 10Slyngshede: [C:03+2] P:cache:haproxy add datacenter information to provenance [puppet] - 10https://gerrit.wikimedia.org/r/1182782 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [10:45:16] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:46:39] (03CR) 10JMeybohm: "I did deploy !102 and ran `requestctl upgrade-schema ipblock`" [puppet] - 10https://gerrit.wikimedia.org/r/1188300 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [10:48:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [10:48:38] (03CR) 10Effie Mouzeli: [C:03+2] Revert "P:hcaptcha: fix LVS monitoring endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/1188302 (owner: 10Effie Mouzeli) [10:55:08] !log jiji@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs [10:58:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:03:53] jiji@cumin1003 restart-pybal (PID 2666742) is awaiting input [11:06:25] RESOLVED: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:53] !log jiji@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs [11:10:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T402925)', diff saved to https://phabricator.wikimedia.org/P83308 and previous config saved to /var/cache/conftool/dbconfig/20250915-111005-ladsgroup.json [11:10:09] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:10:56] (03PS1) 10Stevemunene: Add kubeconfig files for echoserver on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1188307 (https://phabricator.wikimedia.org/T404433) [11:11:48] RESOLVED: PuppetFailure: Puppet has failed on maps2012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:12:10] (03PS1) 10Muehlenhoff: Apply replica role to maps1012-1014 [puppet] - 10https://gerrit.wikimedia.org/r/1188308 (https://phabricator.wikimedia.org/T381565) [11:15:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188308 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:15:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - proxoid_4260: Servers urldownloader1003.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:19:34] (03PS1) 10Fabfur: hiera: remove unneeded option for hcaptcha service [puppet] - 10https://gerrit.wikimedia.org/r/1188309 (https://phabricator.wikimedia.org/T404388) [11:21:14] (03PS2) 10Fabfur: hiera: remove unneeded option for hcaptcha service [puppet] - 10https://gerrit.wikimedia.org/r/1188309 (https://phabricator.wikimedia.org/T404388) [11:23:59] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:25:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P83310 and previous config saved to /var/cache/conftool/dbconfig/20250915-112512-ladsgroup.json [11:26:07] !log btullis@cumin1003 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [11:26:25] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:27] (03PS3) 10Fabfur: hiera: remove unneeded option for hcaptcha service [puppet] - 10https://gerrit.wikimedia.org/r/1188309 (https://phabricator.wikimedia.org/T404388) [11:30:11] (03CR) 10Elukey: [C:03+1] hiera: remove unneeded option for hcaptcha service [puppet] - 10https://gerrit.wikimedia.org/r/1188309 (https://phabricator.wikimedia.org/T404388) (owner: 10Fabfur) [11:31:05] (03CR) 10Majavah: [C:04-1] passwords: root authorized-keys has moved to puppet.git (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [11:32:04] (03CR) 10Effie Mouzeli: [C:03+1] hiera: remove unneeded option for hcaptcha service [puppet] - 10https://gerrit.wikimedia.org/r/1188309 (https://phabricator.wikimedia.org/T404388) (owner: 10Fabfur) [11:32:11] (03CR) 10Fabfur: [C:03+2] hiera: remove unneeded option for hcaptcha service [puppet] - 10https://gerrit.wikimedia.org/r/1188309 (https://phabricator.wikimedia.org/T404388) (owner: 10Fabfur) [11:32:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [11:32:56] (03PS2) 10Filippo Giunchedi: passwords: root authorized-keys has moved to puppet.git [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) [11:33:05] (03CR) 10Filippo Giunchedi: passwords: root authorized-keys has moved to puppet.git (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [11:33:31] (03CR) 10Majavah: [C:03+1] passwords: root authorized-keys has moved to puppet.git [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [11:35:28] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1160* gradually with 4 steps - Work done [11:35:43] !log restarting pybal on lvs1020 to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188309 (T404388) [11:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:09] (03PS1) 10Btullis: Revert to using a hostname for the druid_poublic coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1188312 (https://phabricator.wikimedia.org/T403955) [11:36:20] (03PS5) 10Stevemunene: druid: Add druid druid101[2-3] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1182700 (https://phabricator.wikimedia.org/T397441) [11:36:20] (03PS4) 10Stevemunene: druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) [11:36:20] (03PS1) 10Stevemunene: Add druid101[2,3] to the druid_public_hosts network/firewall range [puppet] - 10https://gerrit.wikimedia.org/r/1188313 (https://phabricator.wikimedia.org/T397441) [11:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:03] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6929/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188312 (https://phabricator.wikimedia.org/T403955) (owner: 10Btullis) [11:38:01] (03CR) 10Btullis: [C:03+1] Add druid101[2,3] to the druid_public_hosts network/firewall range [puppet] - 10https://gerrit.wikimedia.org/r/1188313 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene) [11:38:50] (03CR) 10Stevemunene: [C:03+2] Add druid101[2,3] to the druid_public_hosts network/firewall range [puppet] - 10https://gerrit.wikimedia.org/r/1188313 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene) [11:38:56] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11180482 (10Jhancock.wm) @elukey I forgot to comment when i finished up last week. I got 2049 fixed and go 50-52 ready to go. However, we've had an issue in the... [11:40:07] (03CR) 10Stevemunene: [C:03+1] Revert to using a hostname for the druid_poublic coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1188312 (https://phabricator.wikimedia.org/T403955) (owner: 10Btullis) [11:40:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P83312 and previous config saved to /var/cache/conftool/dbconfig/20250915-114020-ladsgroup.json [11:43:20] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11180488 (10MoritzMuehlenhoff) [11:48:40] (03CR) 10Btullis: [V:03+1 C:03+2] Revert to using a hostname for the druid_poublic coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1188312 (https://phabricator.wikimedia.org/T403955) (owner: 10Btullis) [11:53:40] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1188317 [11:55:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T402925)', diff saved to https://phabricator.wikimedia.org/P83313 and previous config saved to /var/cache/conftool/dbconfig/20250915-115527-ladsgroup.json [11:55:32] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:56:25] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:01:25] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:57] (03CR) 10Hnowlan: [C:03+1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1188317 (owner: 10Muehlenhoff) [12:03:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.082s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:03:20] (03PS2) 10Arnaudb: mailman: add a local disk cache [puppet] - 10https://gerrit.wikimedia.org/r/1188320 (https://phabricator.wikimedia.org/T353891) [12:03:20] (03CR) 10Arnaudb: "pcc went ok:" [puppet] - 10https://gerrit.wikimedia.org/r/1188320 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [12:05:37] (03PS2) 10Muehlenhoff: Apply replica role to maps1012-1014 [puppet] - 10https://gerrit.wikimedia.org/r/1188308 (https://phabricator.wikimedia.org/T381565) [12:09:11] (03CR) 10Btullis: [C:03+1] "As noted in https://phabricator.wikimedia.org/T404068#11180313 this is going to add the echoserver tokens for both dse-k8s clusters, even " [puppet] - 10https://gerrit.wikimedia.org/r/1188307 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [12:11:20] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404449#11180572 (10Jclark-ctr) a:03Jclark-ctr [12:12:02] (03CR) 10Gergő Tisza: [C:04-2] "(being done in I0e7c01ddc67d4924f)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [12:12:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404449#11180573 (10Jclark-ctr) 05Open→03Resolved [12:13:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.222s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:13:53] (03CR) 10Stevemunene: [C:03+2] Add kubeconfig files for echoserver on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1188307 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [12:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:17:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:17:48] (03CR) 10D3r1ck01: [C:03+1] Allow ClosedWikiProvider on the local domain on SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187980 (https://phabricator.wikimedia.org/T393473) (owner: 10Gergő Tisza) [12:18:31] (03CR) 10Stevemunene: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [12:21:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188308 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:21:25] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:45] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1188317 (owner: 10Muehlenhoff) [12:24:06] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1188327 (owner: 10L10n-bot) [12:26:00] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for ms-be1068.mgmt:22 - https://phabricator.wikimedia.org/T404535#11180617 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced Cable Verified link restored [12:26:25] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:26:45] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188330 [12:30:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187980 (https://phabricator.wikimedia.org/T393473) (owner: 10Gergő Tisza) [12:31:25] RESOLVED: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:26] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1188338 (https://phabricator.wikimedia.org/T404586) [12:41:25] FIRING: [4x] SystemdUnitFailed: postgresql@15-main.service on maps2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:14] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:49:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s6 T404586 [12:49:37] T404586: Switchover s6 master (db2229 -> db2214) - https://phabricator.wikimedia.org/T404586 [12:49:38] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:50:58] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:57:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:59:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2214 with weight 0 T404586', diff saved to https://phabricator.wikimedia.org/P83320 and previous config saved to /var/cache/conftool/dbconfig/20250915-125903-fceratto.json [12:59:08] T404586: Switchover s6 master (db2229 -> db2214) - https://phabricator.wikimedia.org/T404586 [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1300). [13:00:05] MatmaRex, tgr, anzx, and joelyrookewmde: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:18] hi! [13:00:33] hi [13:02:20] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188340 [13:03:35] o/ [13:05:08] I guess I can deploy :) [13:05:41] let’s start with killing PHP session handling [13:06:11] oh, wait [13:06:59] no, I bet deploying is still blocked on https://phabricator.wikimedia.org/T404392 [13:07:12] as jnuche noticed earlier today [13:07:39] or at least, `scap backport` deployments are blocked on that [13:09:03] yeah, looks like we don't have a patch for that yet :( We need a new version of the patch [13:09:15] *we don't have a fix [13:10:30] I’m checking if I can rebase it myself [13:10:40] though I guess I’d still need someone’s approval/review in that case [13:12:08] oh [13:12:12] the patches *were* published [13:12:13] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/1187857 [13:12:14] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/1187858 [13:12:29] jnuche: I *think* in that case `scap remove-patch` should be enough, right? [13:12:58] (gerritbot didn’t comment on the task because it doesn’t have access) [13:14:00] Lucas_WMDE: I'm not sure, that could expose the vulnerability again [13:14:10] hm [13:14:17] yeah, i was just going to say, it looks like you can drop them for the next branch [13:14:18] I guess we need to drop them from wmf.19, but not wmf.18? [13:14:26] but not the current branch, since they were not backported [13:14:27] or just backport the now-public patches to wmf.18 [13:14:31] yes [13:15:01] does `scap remove-patch` do a deployment immediately? [13:15:12] or can I `scap remove-patch`, then deploy those backports first, and then continue with the window? [13:15:14] CC James_F btw [13:15:20] so there are two failing patches for that task right now, one affects 18 and it's blocking backports, the othe one affects 19 and will block the upcoming train if it's not fixed [13:15:33] Argh. [13:16:02] Security patches and multiple human authors do not mix. Sigh. [13:16:18] * Lucas_WMDE tries to figure out what’s affecting wmf.18 [13:16:22] They just need cherry-picking and the security patches being manually removed before a manual sync-world. [13:16:25] RESOLVED: [2x] SystemdUnitFailed: postgresql@15-main.service on maps2014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:30] Which'll make jnuche cry. [13:16:43] (03PS1) 10Jforrester: SECURITY: Do not let getErrorMessages() etc. return HTML ever, at least for now [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188342 (https://phabricator.wikimedia.org/T404392) [13:17:51] And of course they don't cherry-pick apart. One moment. [13:18:02] James_F: as long as we get the right patches in the end, it's all good :) [13:18:17] James_F: you can specify the base when cherry-picking [13:18:30] MatmaRex: Yes, but branch or hash? [13:18:32] (03PS1) 10Muehlenhoff: Enable tile invalidation for the new maps nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1188345 (https://phabricator.wikimedia.org/T381565) [13:18:48] Oh, I suppose I can specify the unmerged hash of the first cherry-pick rather than manually pushing? [13:18:50] in "Provide base commit sha1 for cherry-pick" copy-paste the sha1 of the base commit (b0755d6341df238980378429fa1dfb4a5ce00920) [13:18:55] yes [13:19:14] (03PS2) 10Jforrester: SECURITY: Do not let error type labels or arguments return HTML either [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188343 (https://phabricator.wikimedia.org/T404392) [13:19:14] (03PS2) 10Jforrester: SECURITY: Do not let error type labels or arguments return HTML either [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188343 (https://phabricator.wikimedia.org/T404392) [13:19:19] Lucas_WMDE: btw, `scap remove-patch` does not do a deployment [13:19:29] There we go. [13:19:51] Lucas_WMDE: Are you OK to scap remove-patch + deploy? Or should I? [13:21:04] jnuche: thanks [13:21:08] James_F: either works for me [13:21:15] * Lucas_WMDE looks at the wmf.18 patches [13:21:22] Lucas_WMDE: Happy for you to go ahead and I'll stand back. [13:21:26] ok [13:21:56] I’ll first try a spiderpig with just those patches, just to see it fail really [13:22:01] then manually scap remove-patch [13:22:03] They'll fail right now, yes. [13:22:03] then another spiderpig [13:22:12] Because the git hashes differ. [13:22:14] and if that still fails, try a sync-world [13:22:15] yeah [13:22:34] But if you `scap remove-patch` and then try, spiderpig should Just Work™ [13:22:37] (Famous last words.) [13:22:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188342 (https://phabricator.wikimedia.org/T404392) (owner: 10Jforrester) [13:22:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188343 (https://phabricator.wikimedia.org/T404392) (owner: 10Jforrester) [13:22:53] let’s find out [13:22:54] * Lucas_WMDE scaps [13:23:02] Thanks MatmaRex, BTW. [13:23:12] oh, right, also CI has to run first [13:23:18] Also that. [13:23:28] (03PS1) 10Effie Mouzeli: Revert "P:hcaptcha: only listen to local addresses" [puppet] - 10https://gerrit.wikimedia.org/r/1188347 [13:23:28] But our CI for WL is pretty fast. [13:23:41] Because we aren't (yet) in the gate, so… [13:23:45] (03CR) 10CI reject: [V:04-1] Revert "P:hcaptcha: only listen to local addresses" [puppet] - 10https://gerrit.wikimedia.org/r/1188347 (owner: 10Effie Mouzeli) [13:24:00] 06SRE: requestctl support to enable/disable ipblocks - https://phabricator.wikimedia.org/T404591 (10JMeybohm) 03NEW [13:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:24:57] (03Abandoned) 10Effie Mouzeli: Revert "P:hcaptcha: only listen to local addresses" [puppet] - 10https://gerrit.wikimedia.org/r/1188347 (owner: 10Effie Mouzeli) [13:25:58] (03Merged) 10jenkins-bot: SECURITY: Do not let getErrorMessages() etc. return HTML ever, at least for now [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188342 (https://phabricator.wikimedia.org/T404392) (owner: 10Jforrester) [13:26:57] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1188338 (https://phabricator.wikimedia.org/T404586) (owner: 10Gerrit maintenance bot) [13:28:19] (03PS1) 10Effie Mouzeli: P:hcaptcha: update proxy to listen properly [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) [13:28:23] !log Starting s6 codfw failover from db2229 to db2214 - T404586 [13:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:27] T404586: Switchover s6 master (db2229 -> db2214) - https://phabricator.wikimedia.org/T404586 [13:28:47] (03CR) 10CI reject: [V:04-1] P:hcaptcha: update proxy to listen properly [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [13:28:49] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [13:29:11] (03PS1) 10Sohom Datta: Prevent Curation toolbar from preventDefaulting all left click pointer events [extensions/PageTriage] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188349 (https://phabricator.wikimedia.org/T404405) [13:29:19] (03PS2) 10Effie Mouzeli: P:hcaptcha: update proxy to listen properly [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) [13:29:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188345 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:29:47] (03CR) 10CI reject: [V:04-1] P:hcaptcha: update proxy to listen properly [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [13:29:55] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188340 (owner: 10Muehlenhoff) [13:30:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/PageTriage] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188349 (https://phabricator.wikimedia.org/T404405) (owner: 10Sohom Datta) [13:30:17] (03Merged) 10jenkins-bot: SECURITY: Do not let error type labels or arguments return HTML either [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188343 (https://phabricator.wikimedia.org/T404392) (owner: 10Jforrester) [13:30:37] (03PS3) 10Effie Mouzeli: P:hcaptcha: update proxy to listen properly [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) [13:31:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2214 to s6 primary T404586', diff saved to https://phabricator.wikimedia.org/P83321 and previous config saved to /var/cache/conftool/dbconfig/20250915-133108-fceratto.json [13:31:57] yup, scap prep failed [13:32:27] * James_F nods. [13:32:48] “scap: error: extra arguments found:” [13:32:50] grmblgrmbl [13:32:59] does it not like me removing multiple files at once [13:33:08] Yeah, probably not. :-( [13:33:15] * Lucas_WMDE looks if it does anything special beyond a git rm + commit [13:33:20] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [13:33:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2229 T404586', diff saved to https://phabricator.wikimedia.org/P83322 and previous config saved to /var/cache/conftool/dbconfig/20250915-133322-fceratto.json [13:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:34:06] nope, looks like it’s really just that [13:34:11] I’ll just do it manually then [13:34:35] (03CR) 10Muehlenhoff: P:hcaptcha: update proxy to listen properly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [13:35:33] committed [13:36:02] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1188342|SECURITY: Do not let getErrorMessages() etc. return HTML ever, at least for now (T404392)]], [[gerrit:1188343|SECURITY: Do not let error type labels or arguments return HTML either (T404392)]] [13:36:21] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [13:36:22] (03PS4) 10Effie Mouzeli: P:hcaptcha: update proxy to listen properly [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) [13:36:24] Lucas_WMDE: Oh, once the security patches are removed you could have just re-continued with SpiderPig. [13:36:38] But sync-world will be faster. [13:36:45] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [13:36:48] I’m doing another spiderpig now [13:36:54] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [13:36:55] do you mean I could have resurrected the failed one somehow? [13:36:57] I don’t see how [13:37:03] (it’s building the image now) [13:37:03] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [13:37:09] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [13:37:23] Lucas_WMDE: No, just start a new SpiderPig run with the same patches; it'll see they're merged and proceed. [13:37:33] ok, that’s what I did :) [13:37:57] Ah, I thought you were doing the above manually. Never mind! :-) [13:38:08] (03CR) 10Effie Mouzeli: P:hcaptcha: update proxy to listen properly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [13:39:39] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [13:40:18] also filed a feature request for scap: T404593 [13:40:19] T404593: scap remove-patch: support removing multiple patches - https://phabricator.wikimedia.org/T404593 [13:42:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump weight of db1211 (T403966)', diff saved to https://phabricator.wikimedia.org/P83323 and previous config saved to /var/cache/conftool/dbconfig/20250915-134220-ladsgroup.json [13:42:25] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [13:42:53] image build feels a bit slow [13:43:01] maybe because we’re building two images [13:43:15] It rebuilt languages. [13:43:19] Which is… not ideal. [13:43:30] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [13:43:34] hmph [13:43:47] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: update proxy to listen properly [puppet] - 10https://gerrit.wikimedia.org/r/1188348 (https://phabricator.wikimedia.org/T404388) (owner: 10Effie Mouzeli) [13:43:49] No i18n changes in the security patches; did something get landed but not deployed that changed i18n elsewise? [13:44:03] it pushed *-publish-81 in four seconds (13:40:14 → 13:40:18) and has been pushing *-publish-83 for ~three minutes now :/ [13:44:53] jouncebot: nowandnext [13:44:53] For the next 0 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1300) [13:44:53] In 0 hour(s) and 45 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1430) [13:45:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Fix weight of s3 replicas in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83324 and previous config saved to /var/cache/conftool/dbconfig/20250915-134537-ladsgroup.json [13:47:07] (03Abandoned) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1172625 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [13:47:39] wondering how to prioritize the remaining time in the window [13:47:45] (03CR) 10Scott French: [C:03+1] spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [13:47:49] I feel like MatmaRex’ change should be isolated and not deployed together with anything else [13:48:08] anzx’s change can probably wait for another window as the workshop isn’t *that* close (sorry) [13:48:20] joelyrookewmde’s change looks like a cleanup of the config [13:48:27] there is still 40 minutes until next window [13:48:28] tgr: how urgent are your two config changes? [13:48:37] ah, I didn’t notice the gap [13:48:43] they aren't urgent [13:49:16] scap is now “Waiting 300 seconds for swift after full mediawiki image build (T390251)” btw [13:49:17] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [13:49:27] the session handling one is the more urgent one I think because we want to avoid it overlapping with other upcoming session-related deployments [13:49:51] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:49:54] then let’s do session handling / MatmaRex first (once the current deployment is done) [13:50:02] then anzx, to prioritize volunteers over staff a bit :) [13:50:08] and then whatever else we still have time for until the xLab window [13:50:31] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:50:38] if those changes are risk-free enough we can maybe combine them together (IP cap + WebAuthn, or IP cap + Wikibase feature flag) [13:50:55] !log Created suggested investigation database tables on test2wiki - T404594 [13:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:59] T404594: Create suggested investigation database tables on test2wiki - https://phabricator.wikimedia.org/T404594 [13:51:07] makes sense to me [13:51:13] the WebAuthn / ClosedWikiProvider patches are low risk [13:51:15] (03PS1) 10Dreamy Jazz: Document that test2wiki has suggested investigations DB tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188352 (https://phabricator.wikimedia.org/T404594) [13:51:19] ack, thanks [13:51:25] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:51:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188352 (https://phabricator.wikimedia.org/T404594) (owner: 10Dreamy Jazz) [13:52:18] James_F: is the security fix easy to test btw? (once spiderpig gets that far) [13:52:23] would be nice to make sure that I didn’t undeploy it ^^ [13:52:25] Lucas_WMDE: Yes. [13:52:28] ok [13:53:16] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:54:33] (03PS3) 10Filippo Giunchedi: wmcs: expand on ferm::service and profile::wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) [13:55:49] finally finished building the image [13:55:54] At last. [13:56:17] (03CR) 10Filippo Giunchedi: "Thank you, while that might work it is more effort (already!) than I am willing to do on this. In other words at least at this time there'" [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [13:56:30] (03CR) 10Filippo Giunchedi: wmcs: expand on ferm::service and profile::wmcs::instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [13:56:51] (03Abandoned) 10Slyngshede: P:cache::haproxy add ASN lookup function [puppet] - 10https://gerrit.wikimedia.org/r/1179136 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [13:58:26] (03CR) 10Majavah: [C:03+1] wmcs: expand on ferm::service and profile::wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [13:59:11] (03CR) 10Majavah: [V:03+1 C:03+2] dnsrecursor: Add an option to log queries [puppet] - 10https://gerrit.wikimedia.org/r/1187497 (https://phabricator.wikimedia.org/T404373) (owner: 10Majavah) [13:59:22] !log lucaswerkmeister-wmde@deploy1003 jforrester, lucaswerkmeister-wmde: Backport for [[gerrit:1188342|SECURITY: Do not let getErrorMessages() etc. return HTML ever, at least for now (T404392)]], [[gerrit:1188343|SECURITY: Do not let error type labels or arguments return HTML either (T404392)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:59:29] Testing now. [13:59:29] James_F: please test :) [13:59:31] thanks [13:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:59:48] (03PS3) 10Bking: opensearch-operator: remove unnecessary ClusterRoles from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187900 (https://phabricator.wikimedia.org/T397246) [13:59:51] Yup, looks to still be fixed. [13:59:55] !log lucaswerkmeister-wmde@deploy1003 jforrester, lucaswerkmeister-wmde: Continuing with sync [13:59:55] yay [14:01:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Removing db1259 from dumps/vslow group (T403966)', diff saved to https://phabricator.wikimedia.org/P83325 and previous config saved to /var/cache/conftool/dbconfig/20250915-140115-ladsgroup.json [14:01:20] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [14:03:27] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2229* gradually with 4 steps - Pool in after flip [14:05:15] Lucas_WMDE: I will be here if you're deploying, throttle patch can also be rescheduled for tomorrow , no issues for rescheduling it [14:05:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Setting db2203 weight', diff saved to https://phabricator.wikimedia.org/P83327 and previous config saved to /var/cache/conftool/dbconfig/20250915-140555-fceratto.json [14:06:56] ok, thanks! [14:07:57] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump general weight of db1259 to 500 (T403966)', diff saved to https://phabricator.wikimedia.org/P83328 and previous config saved to /var/cache/conftool/dbconfig/20250915-140822-ladsgroup.json [14:08:27] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [14:08:57] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1188358 (https://phabricator.wikimedia.org/T404595) [14:09:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:10:22] (03CR) 10Bking: [C:03+2] opensearch-operator: remove unnecessary ClusterRoles from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187900 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:12:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T404595 [14:12:27] T404595: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T404595 [14:12:31] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188342|SECURITY: Do not let getErrorMessages() etc. return HTML ever, at least for now (T404392)]], [[gerrit:1188343|SECURITY: Do not let error type labels or arguments return HTML either (T404392)]] (duration: 36m 28s) [14:12:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:13:06] jouncebot: nowandnext [14:13:06] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [14:13:07] In 0 hour(s) and 16 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1430) [14:13:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza) [14:13:26] (03PS1) 10Bking: opensearch-operator: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188359 (https://phabricator.wikimedia.org/T397246) [14:13:27] let’s try the session handling [14:13:34] and anything beyond that is probably not happening in this window after all [14:13:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2218 with weight 0 T404595', diff saved to https://phabricator.wikimedia.org/P83329 and previous config saved to /var/cache/conftool/dbconfig/20250915-141343-fceratto.json [14:13:59] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:14:04] (03CR) 10Scott French: "Awesome, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/1188300 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [14:14:08] (03Merged) 10jenkins-bot: Set $wgPHPSessionHandling to 'disable' on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza) [14:14:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2218 from API/vslow/dump T404595', diff saved to https://phabricator.wikimedia.org/P83330 and previous config saved to /var/cache/conftool/dbconfig/20250915-141412-fceratto.json [14:14:28] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1144497|Set $wgPHPSessionHandling to 'disable' on remaining wikis (T362324)]] [14:14:32] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [14:19:09] (03CR) 10Btullis: [C:03+1] opensearch-operator: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188359 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:19:25] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1188358 (https://phabricator.wikimedia.org/T404595) (owner: 10Gerrit maintenance bot) [14:20:18] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, tgr: Backport for [[gerrit:1144497|Set $wgPHPSessionHandling to 'disable' on remaining wikis (T362324)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:20:22] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [14:20:33] anything to test for the session change? cc MatmaRex tgr [14:20:41] !log joal@deploy1003 Started deploy [analytics/refinery@edfea88] (hadoop-test): Unique-devices change for unified routing TEST [analytics/refinery@edfea882] [14:20:50] i can take a quick look at logins [14:21:04] !log Starting s7 codfw failover from db2220 to db2218 - T404595 [14:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:08] T404595: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T404595 [14:21:13] it was already deployed once and only rolled back due to OAuth issues, right? [14:21:36] yeah ^^ [14:21:48] yes [14:21:49] * Lucas_WMDE looks how long it’s been on group1 this time [14:21:50] !log joal@deploy1003 Finished deploy [analytics/refinery@edfea88] (hadoop-test): Unique-devices change for unified routing TEST [analytics/refinery@edfea882] (duration: 01m 09s) [14:21:52] logins seems fine [14:22:00] so I wouldn't expect any immediately visible problems, we'd have seen those last time [14:22:05] i'm not expecting any issues, i think we would have seen them on group1 [14:22:08] some weird lucaswerkmeister person complained about broken stuff last time ;) [14:22:08] !log joal@deploy1003 Started deploy [analytics/refinery@edfea88]: Unique-devices change for unified routing [analytics/refinery@edfea882] [14:22:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2218 to s7 primary T404595', diff saved to https://phabricator.wikimedia.org/P83332 and previous config saved to /var/cache/conftool/dbconfig/20250915-142221-fceratto.json [14:22:44] ok, it’s been on group1 for almost five days already, that should be enough https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1187066 [14:22:48] let’s go [14:22:50] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, tgr: Continuing with sync [14:23:23] (03CR) 10Bking: [C:03+2] opensearch-operator: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188359 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:23:38] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11181050 (10jhathaway) p:05Triage→03Medium [14:23:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:24:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2220 T404595', diff saved to https://phabricator.wikimedia.org/P83333 and previous config saved to /var/cache/conftool/dbconfig/20250915-142436-fceratto.json [14:25:17] (03Merged) 10jenkins-bot: opensearch-operator: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188359 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:25:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s2 replicas in eqiad (T403966)', diff saved to https://phabricator.wikimedia.org/P83334 and previous config saved to /var/cache/conftool/dbconfig/20250915-142530-ladsgroup.json [14:25:34] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [14:26:21] !log joal@deploy1003 Finished deploy [analytics/refinery@edfea88]: Unique-devices change for unified routing [analytics/refinery@edfea882] (duration: 04m 12s) [14:27:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2220.codfw.wmnet with reason: Maintenance [14:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:55] !log joal@deploy1003 Started deploy [analytics/refinery@edfea88] (thin): Unique-devices change for unified routing THIN [analytics/refinery@edfea882] [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1430) [14:30:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1254 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83335 and previous config saved to /var/cache/conftool/dbconfig/20250915-143008-ladsgroup.json [14:30:18] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1144497|Set $wgPHPSessionHandling to 'disable' on remaining wikis (T362324)]] (duration: 15m 49s) [14:30:23] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [14:30:47] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2220 gradually with 4 steps - Pooling in after schema change [14:31:12] !log joal@deploy1003 Finished deploy [analytics/refinery@edfea88] (thin): Unique-devices change for unified routing THIN [analytics/refinery@edfea882] (duration: 01m 17s) [14:31:16] 10SRE-tools, 06Infrastructure-Foundations: secure-cookbook doesn't allow for --dry-run - https://phabricator.wikimedia.org/T404355#11181082 (10LSobanski) p:05Triage→03Medium [14:31:22] !log UTC afternoon backport+config window done [14:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:56] (03CR) 10Filippo Giunchedi: [C:03+2] wmcs: expand on ferm::service and profile::wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/1184792 (https://phabricator.wikimedia.org/T401899) (owner: 10Filippo Giunchedi) [14:35:03] (03PS1) 10Fabfur: Revert "hiera: remove unneeded option for hcaptcha service" [puppet] - 10https://gerrit.wikimedia.org/r/1188365 [14:35:23] thanks for deploying Lucas_WMDE [14:36:47] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] passwords: root authorized-keys has moved to puppet.git [labs/private] - 10https://gerrit.wikimedia.org/r/1188287 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [14:40:34] (03PS1) 10Scott French: shellbox-media: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188364 (https://phabricator.wikimedia.org/T403284) [14:41:17] (03PS1) 10Fabfur: profile:cache:haproxy: copy utf8ps lua converter on cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1188366 (https://phabricator.wikimedia.org/T401383) [14:41:51] (03PS1) 10Effie Mouzeli: P:hcaptcha: unset more headers [puppet] - 10https://gerrit.wikimedia.org/r/1188367 (https://phabricator.wikimedia.org/T403416) [14:43:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184480 (https://phabricator.wikimedia.org/T395674) (owner: 10Joely Rooke WMDE) [14:43:33] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188366 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur) [14:45:03] (03PS2) 10JMeybohm: haproxy ipblocks-all: Filter disabled ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/1188300 (https://phabricator.wikimedia.org/T402014) [14:45:11] (03CR) 10JMeybohm: "I decided against adding the check there since it would change behavior in an unexpected way. Some abuse ipblocks (like blocked_nets for e" [puppet] - 10https://gerrit.wikimedia.org/r/1188300 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [14:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:48:54] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2229* gradually with 4 steps - Pool in after flip [14:51:15] (03CR) 10Hnowlan: [C:03+1] shellbox-media: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188364 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [14:55:22] (03CR) 10Hnowlan: [C:03+1] multi-dc: Dynamic rewrite to -ro destinations [puppet] - 10https://gerrit.wikimedia.org/r/1182815 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [14:55:54] (03CR) 10Krinkle: "CC: claime as familar with the production rest-gateway.lua plugin." [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: 10Krinkle) [14:59:18] (03PS1) 10Filippo Giunchedi: profile: clean up root-authorized-key.erb transition [puppet] - 10https://gerrit.wikimedia.org/r/1188374 (https://phabricator.wikimedia.org/T317362) [14:59:38] (03CR) 10BBlack: [C:03+1] profile:cache:haproxy: copy utf8ps lua converter on cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1188366 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur) [15:00:08] (03CR) 10Clément Goubert: [C:03+1] shellbox-media: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188364 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [15:03:54] (03CR) 10Hnowlan: [C:03+1] alert: Add Slack route to send Prometheus alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [15:04:55] (03PS6) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [15:08:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:15:56] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:16:13] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2220 gradually with 4 steps - Pooling in after schema change [15:17:00] (03PS1) 10Fabfur: haproxy: use utf8ps converter on received headers [puppet] - 10https://gerrit.wikimedia.org/r/1188379 (https://phabricator.wikimedia.org/T401383) [15:17:07] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:17:31] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:19:09] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:19:35] (03CR) 10Clément Goubert: [C:03+1] "One nit, otherwise lgtm, but would like a Traffic sanity check just in case, adding @vgutierrez@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: 10Krinkle) [15:20:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11181269 (10elukey) @Jhancock.wm thanks! I tried 2049 today and I ended up with: ` [15:21:45] (03PS1) 10Effie Mouzeli: P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) [15:22:18] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11181281 (10bd808) [15:22:32] (03CR) 10Effie Mouzeli: [C:03+1] shellbox-media: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188364 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [15:23:29] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188379 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur) [15:23:44] (03PS7) 10Clément Goubert: multi-dc: Dynamic rewrite to -ro destinations [puppet] - 10https://gerrit.wikimedia.org/r/1182815 (https://phabricator.wikimedia.org/T402412) [15:23:58] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:23:59] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:24:16] (03CR) 10Clément Goubert: "Rebased following Id74341233348a39a774e1eb222e89ab58605abb5" [puppet] - 10https://gerrit.wikimedia.org/r/1182815 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [15:25:09] (03CR) 10Hnowlan: [C:03+1] multi-dc: Dynamic rewrite to -ro destinations [puppet] - 10https://gerrit.wikimedia.org/r/1182815 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [15:26:49] (03CR) 10Hnowlan: [C:03+1] P:hcaptcha: unset more headers [puppet] - 10https://gerrit.wikimedia.org/r/1188367 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [15:28:50] (03CR) 10Scott French: [C:03+1] "Thanks for explaining." [puppet] - 10https://gerrit.wikimedia.org/r/1188300 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [15:30:05] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1530). [15:32:07] (03PS16) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [15:33:59] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:48] (03CR) 10Clément Goubert: "Tentatively scheduled for Tuesday September 16th UTC mid-day infra window [0]" [puppet] - 10https://gerrit.wikimedia.org/r/1182815 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [15:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:27] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [15:40:36] 06SRE, 10DNS, 06Traffic, 07Language codes: Redirect legacy language codes for Toki Pona to tok.wikipedia.org - https://phabricator.wikimedia.org/T404507#11181396 (10Pppery) > Both of these codes are also referenced in any old revisions containing [[tp:]] or [[tokipona:]] langlinks, so there's a backwards c... [15:40:39] 06SRE, 10DNS, 06Traffic, 07Language codes: Redirect legacy language codes for Toki Pona to tok.wikipedia.org - https://phabricator.wikimedia.org/T404507#11181397 (10Pppery) [15:41:06] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:41:24] elukey@cumin1003 provision (PID 2693459) is awaiting input [15:45:33] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11181440 (10MoritzMuehlenhoff) [15:47:16] (03CR) 10Herron: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [15:47:32] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:49:30] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11181459 (10ABran-WMF) [15:49:31] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, and 2 others: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11181458 (10ABran-WMF) [15:49:32] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Figure out plan for mailman IP situation - https://phabricator.wikimedia.org/T278495#11181460 (10ABran-WMF) [15:50:19] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11181463 (10herron) 05Open→03Resolved Backfill process has been documented in https://wikitech.wikimedia.org/wiki/Thanos#Backfilling_Met... [15:51:51] (03CR) 10Reedy: mailman: add a local disk cache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188320 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [15:58:06] (03PS1) 10Majavah: toolforge: wheel-of-misfortune: Exclude sd-pam [puppet] - 10https://gerrit.wikimedia.org/r/1188390 (https://phabricator.wikimedia.org/T404601) [16:00:04] jasmine_, swfrench-wmf, and hnowlan: It is that lovely time of the day again! You are hereby commanded to deploy DC Switchover Live test. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1600). [16:00:29] (03CR) 10BCornwall: [C:03+1] varnish: Switch Commons/Googlebot pilot from desktop to unified mobile [puppet] - 10https://gerrit.wikimedia.org/r/1187876 (https://phabricator.wikimedia.org/T397267) (owner: 10Krinkle) [16:00:42] (03CR) 10BCornwall: [V:03+1 C:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1187876 (https://phabricator.wikimedia.org/T397267) (owner: 10Krinkle) [16:01:01] oh whoops, sorry about that ^ we won't be needing that deployment window in case anyone would like to use it, I'll free it up now)) [16:01:19] (03CR) 10Arnaudb: mailman: add a local disk cache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188320 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [16:01:36] 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11181575 (10HShaikh) reading the exchange above. I feel like there is a need to get the IPs whitelisted and i can forward that request but I also see that there is a... [16:01:39] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:01:52] (03CR) 10Slyngshede: [C:03+1] profile:cache:haproxy: copy utf8ps lua converter on cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1188366 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur) [16:02:38] (03PS17) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [16:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:09:43] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [16:10:47] (03CR) 10Majavah: [C:03+2] toolforge: wheel-of-misfortune: Exclude sd-pam [puppet] - 10https://gerrit.wikimedia.org/r/1188390 (https://phabricator.wikimedia.org/T404601) (owner: 10Majavah) [16:13:51] 06SRE, 10DNS, 06Traffic, 07Language codes: Redirect legacy language codes for Toki Pona to tok.wikipedia.org - https://phabricator.wikimedia.org/T404507#11181616 (10taavi) [16:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:16] 06SRE, 10DNS, 06Traffic, 07Language codes: Redirect legacy language codes for Toki Pona to tok.wikipedia.org - https://phabricator.wikimedia.org/T404507#11181623 (10taavi) Per the task description. This is not actually blocked on the wiki creation, nor is this required for the wiki itself to actually funct... [16:16:08] (03CR) 10Majavah: [C:03+2] openstack: nova: fullstack: Drop --ipv6 flag [puppet] - 10https://gerrit.wikimedia.org/r/1187821 (owner: 10Majavah) [16:16:30] (03CR) 10Majavah: [C:03+2] openstack: nova: fullstack: Use Trixie image [puppet] - 10https://gerrit.wikimedia.org/r/1187822 (owner: 10Majavah) [16:17:07] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609 (10RobH) 03NEW p:05Triage→03Medium [16:17:16] (03PS7) 10Majavah: openstack: nova: fullstack: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1187823 [16:18:46] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11181650 (10RobH) @cmooney: What do you think is the best way to go about migrating these connections on upcoming C/D updates? The new switch will be online in the ra... [16:19:04] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6933/console" [puppet] - 10https://gerrit.wikimedia.org/r/1187823 (owner: 10Majavah) [16:19:51] (03CR) 10Majavah: [V:03+1 C:03+2] openstack: nova: fullstack: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1187823 (owner: 10Majavah) [16:30:01] (03CR) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: 10Krinkle) [16:30:19] (03PS12) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) [16:45:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:49:03] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11181895 (10elukey) The formula for the "fixed" error budget of the current month could be translated, with some quirks, in something like the following (not tested): ` 1-( sum_over_time( ( slo... [16:50:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:50:40] (03PS1) 10Huei Tan: XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 [16:51:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [16:59:39] (03Abandoned) 10Huei Tan: XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [17:00:05] swfrench-wmf: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki Infrastructure (UTC late) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1700). [17:00:05] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T1700). [17:00:17] o/ [17:01:11] (03CR) 10Scott French: "Thanks for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188364 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [17:01:12] (03CR) 10Scott French: [C:03+2] shellbox-media: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188364 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [17:03:07] (03Merged) 10jenkins-bot: shellbox-media: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188364 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [17:05:22] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [17:05:40] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [17:06:32] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:07:08] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:09:09] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:09:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:13:59] FIRING: NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [17:18:59] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [17:23:58] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:23:58] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.336 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:29:40] (03PS1) 10Joal: Update turnilo conf for unique-devices per domain [puppet] - 10https://gerrit.wikimedia.org/r/1188404 (https://phabricator.wikimedia.org/T401666) [17:30:57] joal: do you need a stamp on the turnilo patch? [17:32:08] I could do with that cdanis :) [17:32:14] (03CR) 10CDanis: [C:03+2] Update turnilo conf for unique-devices per domain [puppet] - 10https://gerrit.wikimedia.org/r/1188404 (https://phabricator.wikimedia.org/T401666) (owner: 10Joal) [17:32:29] I pinged the DPE-SRE folks, but you doing it now helps! thanks a lot [17:32:41] merged and puppet-merged [17:33:08] May I abuse and ask you for a restart of the turnilo service please? [17:33:18] sure thing! would you also like to be able to do that yourself in the future? [17:33:43] Why not! That'd be easier (even if I don't have the right to merge puppet :) [17:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:34:21] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕜☕ sudo cumin 'C:profile::druid::turnilo' 'run-puppet-agent' && sudo cumin 'C:profile::druid::turnilo' 'systemctl restart turnilo' [17:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:06] 10ops-codfw, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626 (10phaultfinder) 03NEW [17:36:16] thanks a lot cdanis <3 [17:36:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:38] jouncebot: turnilo restarted [17:36:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:37:25] oh also joal -- turns out you already have permissions to restart systemctl services on hosts like an-tool1007! [17:37:42] I'll propose adding the ability to trigger a puppet run as well [17:38:11] Great! [17:39:15] i know I could restart services on some hosts, but I didn't know it was available on an-tool1007! [17:40:37] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [17:41:10] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [17:41:34] !log migrated shellbox-media to PHP 8.3 - T403284 [17:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:38] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [17:41:46] joal: on any of the machines with this role in their puppet role https://codesearch.wmcloud.org/puppet/?q=analytics-admins&files=hieradata%2Frole&excludeFiles=&repos=operations%2Fpuppet [17:42:36] ack cdanis - that's a lot more restarting than I thought I'd be able to do :) [17:47:54] (03PS1) 10CDanis: admin: analytics-admins: allow sudo run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/1188408 (https://phabricator.wikimedia.org/T404630) [17:52:47] (03CR) 10Dzahn: [C:03+2] switch people service aliases in eqiad and codfw to new trixie hosts [dns] - 10https://gerrit.wikimedia.org/r/1187885 (https://phabricator.wikimedia.org/T402596) (owner: 10Dzahn) [17:52:54] !log dzahn@dns1004 START - running authdns-update [17:53:36] (03CR) 10Dzahn: [C:03+2] peopleweb: make people2004 the new rsync source [puppet] - 10https://gerrit.wikimedia.org/r/1187884 (https://phabricator.wikimedia.org/T402596) (owner: 10Dzahn) [17:53:40] PROBLEM - people.wikimedia.org requires authentication on people1005 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:54:10] uhm.. that is me and its not expected because I tested that.. looking [17:54:14] !log dzahn@dns1004 END - running authdns-update [17:54:22] (03PS2) 10Jasmine: switchdc/mediawiki: remove references to mw-maint hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1187997 (https://phabricator.wikimedia.org/T404538) [17:58:20] PROBLEM - people.wikimedia.org requires authentication on people2004 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:59:07] FIRING: ProbeDown: Service people2004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people2004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:59:26] (03PS1) 10BCornwall: test, please ignore [cookbooks] - 10https://gerrit.wikimedia.org/r/1188410 [18:00:01] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on people2004.codfw.wmnet with reason: in setup [18:00:25] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on people1005.eqiad.wmnet with reason: in setup [18:01:01] (03PS1) 10Dzahn: Revert "peopleweb: make people2004 the new rsync source" [puppet] - 10https://gerrit.wikimedia.org/r/1188411 [18:01:12] (03CR) 10CI reject: [V:04-1] switchdc/mediawiki: remove references to mw-maint hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1187997 (https://phabricator.wikimedia.org/T404538) (owner: 10Jasmine) [18:01:37] (03CR) 10Dzahn: [C:03+2] Revert "peopleweb: make people2004 the new rsync source" [puppet] - 10https://gerrit.wikimedia.org/r/1188411 (owner: 10Dzahn) [18:01:56] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11182112 (10VRiley-WMF) Hey @MatthewVernon I wanted to check back in with this ticket and see if any of these are available to commence with the swap. No... [18:02:22] (03PS3) 10Jasmine: switchdc/mediawiki: remove references to mw-maint hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1187997 (https://phabricator.wikimedia.org/T404538) [18:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:04:07] (03Abandoned) 10BCornwall: test, please ignore [cookbooks] - 10https://gerrit.wikimedia.org/r/1188410 (owner: 10BCornwall) [18:07:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:13:15] (03PS1) 10MusikAnimal: tables-catalog: add CommunityRequests tables [puppet] - 10https://gerrit.wikimedia.org/r/1188413 (https://phabricator.wikimedia.org/T403559) [18:13:59] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:15:36] (03CR) 10CI reject: [V:04-1] tables-catalog: add CommunityRequests tables [puppet] - 10https://gerrit.wikimedia.org/r/1188413 (https://phabricator.wikimedia.org/T403559) (owner: 10MusikAnimal) [18:16:04] (03CR) 10Novem Linguae: tables-catalog: add CommunityRequests tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188413 (https://phabricator.wikimedia.org/T403559) (owner: 10MusikAnimal) [18:16:20] RECOVERY - people.wikimedia.org requires authentication on people2004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:16:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11182132 (10VRiley-WMF) |Device A|Device A Port|Device B|Device B Port|Type|cableID|Length required| |----------|-----------------|----------|----------|... [18:17:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:18:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:20:40] RECOVERY - people.wikimedia.org requires authentication on people1005 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:21:21] !log dzahn@cumin2002 START - Cookbook sre.hosts.remove-downtime for people2004.codfw.wmnet [18:21:22] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for people2004.codfw.wmnet [18:21:31] !log dzahn@cumin2002 START - Cookbook sre.hosts.remove-downtime for people1005.eqiad.wmnet [18:21:32] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for people1005.eqiad.wmnet [18:23:59] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:26:48] (03PS1) 10Dzahn: Revert^2 "peopleweb: make people2004 the new rsync source" [puppet] - 10https://gerrit.wikimedia.org/r/1188416 [18:28:59] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:34:10] (03CR) 10Dzahn: [C:03+2] Revert^2 "peopleweb: make people2004 the new rsync source" [puppet] - 10https://gerrit.wikimedia.org/r/1188416 (owner: 10Dzahn) [18:36:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:40:46] (03PS2) 10CDanis: admin: analytics-admins: allow sudo puppet-run [puppet] - 10https://gerrit.wikimedia.org/r/1188408 (https://phabricator.wikimedia.org/T404630) [18:44:15] (03PS1) 10Jasmine: hosts: Ignore mypy error code arg-type [cookbooks] - 10https://gerrit.wikimedia.org/r/1188417 (https://phabricator.wikimedia.org/T404635) [18:44:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:46:03] (03PS2) 10Jasmine: sre.hosts.provision: Ignore mypy error code arg-type [cookbooks] - 10https://gerrit.wikimedia.org/r/1188417 (https://phabricator.wikimedia.org/T404635) [18:49:15] (03PS7) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [18:49:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:49:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11182262 (10phaultfinder) [18:56:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187476 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza) [18:56:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187980 (https://phabricator.wikimedia.org/T393473) (owner: 10Gergő Tisza) [18:58:49] (03PS1) 10Gergő Tisza: session: Cache JWT JTI in CookieSessionProvider [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188420 (https://phabricator.wikimedia.org/T399200) [18:59:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188420 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [19:06:01] (03PS8) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [19:06:57] (03PS9) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [19:14:51] (03Abandoned) 10Majavah: Tools: Use exported resources for ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/329382 (https://phabricator.wikimedia.org/T153163) (owner: 10Tim Landscheidt) [19:20:24] (03CR) 10BCornwall: [V:03+2 C:03+1] varnish: Assert whether Commons/Googlebot gets desktop or mobile HTML (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187875 (https://phabricator.wikimedia.org/T397267) (owner: 10Krinkle) [19:23:45] (03PS3) 10Jasmine: sre.hosts.provision: wrap .hosts() call in iter() to meet type expectation in all cases [cookbooks] - 10https://gerrit.wikimedia.org/r/1188417 (https://phabricator.wikimedia.org/T404635) [19:23:59] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:27:07] (03PS10) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [19:33:14] (03CR) 10Scott French: [C:03+1] "Thanks, Jasmine!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1188417 (https://phabricator.wikimedia.org/T404635) (owner: 10Jasmine) [19:33:27] 06SRE, 10Wikimedia-Mailing-lists: Excessive Spam on wikipedia-bn@lists.wikimedia.org Mailing List - https://phabricator.wikimedia.org/T388958#11182353 (10Dzahn) a:03MdsShakil [19:34:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11182357 (10phaultfinder) [19:35:57] (03CR) 10Volans: sre.hosts.provision: wrap .hosts() call in iter() to meet type expectation in all cases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1188417 (https://phabricator.wikimedia.org/T404635) (owner: 10Jasmine) [19:37:28] 06SRE, 10DNS, 06Traffic, 07Language codes: Redirect legacy language codes for Toki Pona to tok.wikipedia.org - https://phabricator.wikimedia.org/T404507#11182387 (10Dzahn) This type of redirect/rewrite would likely have to be handled in the appserver apache config rather than the CDN. That would mean serv... [19:38:56] (03PS11) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [19:41:18] (03PS12) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [19:42:03] 06SRE, 10Wikimedia-Mailing-lists: Excessive Spam on wikipedia-bn@lists.wikimedia.org Mailing List - https://phabricator.wikimedia.org/T388958#11182422 (10MdsShakil) Hey @Dzahn, I don’t have full access to the mailing list. @Aftabuzzaman does. Last time I checked with him, he said he had added a header filter,... [19:42:42] (03PS4) 10Jasmine: sre.hosts.provision: wrap .hosts() call in iter() to meet type expectation in all cases [cookbooks] - 10https://gerrit.wikimedia.org/r/1188417 (https://phabricator.wikimedia.org/T404635) [19:44:04] (03CR) 10BCornwall: [V:03+2 C:03+2] varnish: Assert whether Commons/Googlebot gets desktop or mobile HTML (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187875 (https://phabricator.wikimedia.org/T397267) (owner: 10Krinkle) [19:44:06] (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Switch Commons/Googlebot pilot from desktop to unified mobile [puppet] - 10https://gerrit.wikimedia.org/r/1187876 (https://phabricator.wikimedia.org/T397267) (owner: 10Krinkle) [19:44:11] (03CR) 10Jasmine: sre.hosts.provision: wrap .hosts() call in iter() to meet type expectation in all cases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1188417 (https://phabricator.wikimedia.org/T404635) (owner: 10Jasmine) [19:46:46] (03CR) 10Scott French: [C:03+1] sre.hosts.provision: wrap .hosts() call in iter() to meet type expectation in all cases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1188417 (https://phabricator.wikimedia.org/T404635) (owner: 10Jasmine) [19:49:03] (03PS1) 10Bearloga: statistics: remove product_analytics [puppet] - 10https://gerrit.wikimedia.org/r/1188437 (https://phabricator.wikimedia.org/T404639) [19:54:10] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1188441 [19:56:35] 503 for everything, grafana included [19:56:52] and back [19:57:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [19:58:01] o/ [19:58:10] !incidents [19:58:11] 6746 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [19:58:11] 6742 (RESOLVED) ATSBackendErrorsHigh cache_text sre (miscweb.discovery.wmnet magru) [19:58:11] (03PS1) 10Bking: admin_ng: allow opensearch deploy to use role/rolebinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188446 (https://phabricator.wikimedia.org/T397246) [19:58:20] !ack 6746 [19:58:20] 6746 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [19:58:37] investigating [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T2000) [20:00:04] Sohom_Datta and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:36] (03PS13) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [20:00:39] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [20:01:26] o/ [20:02:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:02:53] is a deployer needed? [20:03:15] Sohom_Datta: i can deploy for you if you are not able to self-deploy [20:03:29] Maybe check with swfrench-wmf the incident is over? [20:03:32] I am not! so that would be great! [20:03:43] oh - was there an incident? [20:03:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:04:19] (I did not know either) [20:04:25] cjming: https://grafana.wikimedia.org/d/000000479/cdn-frontend-traffic?viewPanel=panel-13&orgId=1&from=now-3h&to=now&timezone=utc&var-site=$__all&var-cache_type=upload&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4 [20:04:26] swfrench-wmf: ok to proceed with backport window? [20:04:42] RhinosF1: thanks for that! folks, I believe you should be good to proceed, but it would be handy if you wait for another 5 minutes or so while we check on a couple of things [20:04:57] of course - np - just lmk when it's ok to start [20:04:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11182522 (10phaultfinder) [20:05:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:08:05] cjming: I think you're good to proceed. thanks, folks! [20:08:40] cool - thanks! [20:08:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:09:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/PageTriage] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188349 (https://phabricator.wikimedia.org/T404405) (owner: 10Sohom Datta) [20:09:52] (03PS2) 10Bking: admin_ng: allow opensearch deploy to use role/rolebinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188446 (https://phabricator.wikimedia.org/T397246) [20:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:21:09] (03Merged) 10jenkins-bot: Prevent Curation toolbar from preventDefaulting all left click pointer events [extensions/PageTriage] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188349 (https://phabricator.wikimedia.org/T404405) (owner: 10Sohom Datta) [20:21:28] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1188349|Prevent Curation toolbar from preventDefaulting all left click pointer events (T404405)]] [20:21:32] T404405: Cannot type in some textboxes from the page curation toolbar - https://phabricator.wikimedia.org/T404405 [20:25:51] Should I check if it works on the debug servers ? [20:26:47] Sohom_Datta: sure - lmk if/when to sync [20:26:59] Just checked, it is working :) [20:27:09] cool [20:27:36] !log cjming@deploy1003 cjming, soda: Backport for [[gerrit:1188349|Prevent Curation toolbar from preventDefaulting all left click pointer events (T404405)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:27:40] T404405: Cannot type in some textboxes from the page curation toolbar - https://phabricator.wikimedia.org/T404405 [20:27:57] !log cjming@deploy1003 cjming, soda: Continuing with sync [20:29:18] (03CR) 10Krinkle: "Test case:" [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [20:29:33] 06SRE, 10Wikimedia-Mailing-lists: Excessive Spam on wikipedia-bn@lists.wikimedia.org Mailing List - https://phabricator.wikimedia.org/T388958#11182620 (10Dzahn) a:05MdsShakil→03None [20:33:17] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188349|Prevent Curation toolbar from preventDefaulting all left click pointer events (T404405)]] (duration: 11m 48s) [20:33:21] T404405: Cannot type in some textboxes from the page curation toolbar - https://phabricator.wikimedia.org/T404405 [20:33:31] Sohom_Datta: should be live :) [20:33:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s2 in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83340 and previous config saved to /var/cache/conftool/dbconfig/20250915-203337-ladsgroup.json [20:33:42] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [20:33:43] tgr: will you self-deploy? [20:33:55] cjming: Thank you! [20:34:01] yw! [20:34:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s2 in codfw in api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83341 and previous config saved to /var/cache/conftool/dbconfig/20250915-203425-ladsgroup.json [20:35:20] cjming: can do [20:35:27] all yours! [20:35:55] thx [20:36:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187476 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza) [20:36:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187980 (https://phabricator.wikimedia.org/T393473) (owner: 10Gergő Tisza) [20:36:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188420 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [20:37:39] (03Merged) 10jenkins-bot: Allow creating new WebAuthn passkeys on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187476 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza) [20:37:44] (03Merged) 10jenkins-bot: Allow ClosedWikiProvider on the local domain on SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187980 (https://phabricator.wikimedia.org/T393473) (owner: 10Gergő Tisza) [20:39:13] 06SRE, 10DNS, 06serviceops, 06Traffic, 07Language codes: Redirect legacy language codes for Toki Pona to tok.wikipedia.org - https://phabricator.wikimedia.org/T404507#11182687 (10A_smart_kitten) [20:42:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Removing one replica from api group in codfw s2 (T403966)', diff saved to https://phabricator.wikimedia.org/P83342 and previous config saved to /var/cache/conftool/dbconfig/20250915-204251-ladsgroup.json [20:42:56] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [20:44:04] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188446 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:46:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Removing one replica from api group in eqiad s2 (T403966)', diff saved to https://phabricator.wikimedia.org/P83343 and previous config saved to /var/cache/conftool/dbconfig/20250915-204613-ladsgroup.json [20:47:46] (03CR) 10Dzahn: [C:03+2] "tested queries - first one returns 7 rows - second is currently empty" [puppet] - 10https://gerrit.wikimedia.org/r/1187657 (https://phabricator.wikimedia.org/T404411) (owner: 10Aklapper) [20:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:49:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Revert: Removing one replica from api group in eqiad s2 (T403966)', diff saved to https://phabricator.wikimedia.org/P83344 and previous config saved to /var/cache/conftool/dbconfig/20250915-204922-ladsgroup.json [20:49:28] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [20:49:42] (03Merged) 10jenkins-bot: session: Cache JWT JTI in CookieSessionProvider [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188420 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [20:49:56] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1187476|Allow creating new WebAuthn passkeys on private wikis (T378402 T354701)]], [[gerrit:1187980|Allow ClosedWikiProvider on the local domain on SUL wikis (T393473 T401640)]], [[gerrit:1188420|session: Cache JWT JTI in CookieSessionProvider (T399200)]] [20:50:08] T378402: Disallow setting up new WebAuthn passkeys on Wikimedia wikis - https://phabricator.wikimedia.org/T378402 [20:50:08] T354701: Enable migration of WebAuthn credentials to central domain - https://phabricator.wikimedia.org/T354701 [20:50:09] T393473: Most authentication providers are disabled during autocreation on local domain (SUL3 mode) - https://phabricator.wikimedia.org/T393473 [20:50:10] T401640: ClosedWikiProvider stopped working - https://phabricator.wikimedia.org/T401640 [20:50:11] T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200 [20:56:19] !log tgr@deploy1003 tgr: Backport for [[gerrit:1187476|Allow creating new WebAuthn passkeys on private wikis (T378402 T354701)]], [[gerrit:1187980|Allow ClosedWikiProvider on the local domain on SUL wikis (T393473 T401640)]], [[gerrit:1188420|session: Cache JWT JTI in CookieSessionProvider (T399200)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:56:30] T378402: Disallow setting up new WebAuthn passkeys on Wikimedia wikis - https://phabricator.wikimedia.org/T378402 [20:56:30] T354701: Enable migration of WebAuthn credentials to central domain - https://phabricator.wikimedia.org/T354701 [20:56:31] T393473: Most authentication providers are disabled during autocreation on local domain (SUL3 mode) - https://phabricator.wikimedia.org/T393473 [20:56:31] T401640: ClosedWikiProvider stopped working - https://phabricator.wikimedia.org/T401640 [20:56:31] T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200 [20:58:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s5 in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83345 and previous config saved to /var/cache/conftool/dbconfig/20250915-205814-ladsgroup.json [20:58:19] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [21:00:05] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T2100) [21:01:56] !log tgr@deploy1003 tgr: Continuing with sync [21:03:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s5 in eqiad (T403966)', diff saved to https://phabricator.wikimedia.org/P83346 and previous config saved to /var/cache/conftool/dbconfig/20250915-210311-ladsgroup.json [21:07:19] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187476|Allow creating new WebAuthn passkeys on private wikis (T378402 T354701)]], [[gerrit:1187980|Allow ClosedWikiProvider on the local domain on SUL wikis (T393473 T401640)]], [[gerrit:1188420|session: Cache JWT JTI in CookieSessionProvider (T399200)]] (duration: 17m 23s) [21:07:29] T378402: Disallow setting up new WebAuthn passkeys on Wikimedia wikis - https://phabricator.wikimedia.org/T378402 [21:07:29] T354701: Enable migration of WebAuthn credentials to central domain - https://phabricator.wikimedia.org/T354701 [21:07:29] T393473: Most authentication providers are disabled during autocreation on local domain (SUL3 mode) - https://phabricator.wikimedia.org/T393473 [21:07:30] T401640: ClosedWikiProvider stopped working - https://phabricator.wikimedia.org/T401640 [21:07:30] T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200 [21:09:06] (03CR) 10BCornwall: [V:03+2 C:03+2] "Behaving as expected in the tests, and looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [21:10:35] !log late UTC deploys done [21:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s6 in eqiad (T403966)', diff saved to https://phabricator.wikimedia.org/P83347 and previous config saved to /var/cache/conftool/dbconfig/20250915-211144-ladsgroup.json [21:11:49] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [21:14:15] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/chart-renderer: apply [21:14:25] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [21:14:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s6 in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83348 and previous config saved to /var/cache/conftool/dbconfig/20250915-211457-ladsgroup.json [21:15:12] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:15:15] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:15:32] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [21:15:48] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [21:18:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db2151 from api group in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83349 and previous config saved to /var/cache/conftool/dbconfig/20250915-211838-ladsgroup.json [21:18:43] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [21:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:25:21] (03PS1) 10RLazarus: deployment_server: Add a script for mass-deploying helmfile services [puppet] - 10https://gerrit.wikimedia.org/r/1188456 (https://phabricator.wikimedia.org/T380211) [21:25:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set back the forgotten candidate master weight on s7 codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83351 and previous config saved to /var/cache/conftool/dbconfig/20250915-212526-ladsgroup.json [21:25:32] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [21:29:54] (03CR) 10BCornwall: [C:03+1] envoyproxy: Remove lua_script param [puppet] - 10https://gerrit.wikimedia.org/r/1187126 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [21:31:49] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188458 (https://phabricator.wikimedia.org/T128546) [21:32:19] (03CR) 10RLazarus: [C:03+2] envoyproxy: Remove lua_script param [puppet] - 10https://gerrit.wikimedia.org/r/1187126 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [21:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:51:04] (03PS1) 10Papaul: Add es1056 to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1188460 (https://phabricator.wikimedia.org/T400198) [21:53:47] (03CR) 10Papaul: [C:03+2] Add es1056 to preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1188460 (https://phabricator.wikimedia.org/T400198) (owner: 10Papaul) [22:00:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:01:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11183060 (10Papaul) @VRiley-WMF es1056 added, you can resume with your install. [22:05:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:13:59] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:34:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183129 (10phaultfinder) [22:36:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:40:45] (03CR) 10Cwhite: [C:03+1] "Overall LGTM! \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [22:50:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250915T2300) [23:00:05] Jan Drewniak: A patch you scheduled for Web Team deployment window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:07:48] (03CR) 10Jasmine: [C:03+2] sre.hosts.provision: wrap .hosts() call in iter() to meet type expectation in all cases [cookbooks] - 10https://gerrit.wikimedia.org/r/1188417 (https://phabricator.wikimedia.org/T404635) (owner: 10Jasmine) [23:11:19] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188458 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [23:12:25] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188458 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [23:15:57] (03Merged) 10jenkins-bot: sre.hosts.provision: wrap .hosts() call in iter() to meet type expectation in all cases [cookbooks] - 10https://gerrit.wikimedia.org/r/1188417 (https://phabricator.wikimedia.org/T404635) (owner: 10Jasmine) [23:18:27] (03PS4) 10Jasmine: switchdc/mediawiki: remove references to mw-maint hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1187997 (https://phabricator.wikimedia.org/T404538) [23:20:07] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183221 (10phaultfinder) [23:23:59] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:24:49] !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1184159| Bumping portals to master (T128546)]] (duration: 11m 48s) [23:24:54] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [23:26:38] !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1184159| Bumping portals to master (T128546)]] (duration: 01m 48s) [23:29:08] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:29:08] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:34:04] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 5.079 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:34:04] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:38:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1188476 [23:38:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1188476 (owner: 10TrainBranchBot) [23:38:59] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [23:39:17] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [23:39:59] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/data-gateway: apply [23:40:13] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [23:40:26] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [23:40:36] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [23:40:46] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/device-analytics: apply [23:40:55] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [23:43:09] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/echostore: apply [23:43:25] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/echostore: apply [23:43:35] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/edit-analytics: apply [23:43:51] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [23:44:00] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: apply [23:44:08] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [23:44:16] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [23:44:38] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [23:44:47] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [23:44:55] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [23:45:07] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [23:45:14] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [23:45:23] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply [23:45:31] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [23:45:47] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [23:46:13] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [23:46:23] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [23:46:59] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [23:52:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1188476 (owner: 10TrainBranchBot) [23:54:18] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/geo-analytics: apply [23:54:26] (03CR) 10Scott French: "Thanks, Jasmine!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1187997 (https://phabricator.wikimedia.org/T404538) (owner: 10Jasmine) [23:54:34] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [23:56:43] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/image-suggestion: apply [23:56:58] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply