[00:00:29] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1257* gradually with 4 steps - Work done [00:01:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 6.928 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:01:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T402763)', diff saved to https://phabricator.wikimedia.org/P83217 and previous config saved to /var/cache/conftool/dbconfig/20250911-000142-ladsgroup.json [00:01:47] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [00:08:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1187148 [00:08:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1187148 (owner: 10TrainBranchBot) [00:16:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P83219 and previous config saved to /var/cache/conftool/dbconfig/20250911-001650-ladsgroup.json [00:24:59] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:31:03] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1187148 (owner: 10TrainBranchBot) [00:31:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P83221 and previous config saved to /var/cache/conftool/dbconfig/20250911-003157-ladsgroup.json [00:38:15] (03PS1) 10Scott French: shellbox-constraints: end single-replica 8.3 pilot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187153 (https://phabricator.wikimedia.org/T403284) [00:40:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:40:51] (03CR) 10Scott French: [C:03+2] shellbox-constraints: end single-replica 8.3 pilot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187153 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [00:42:29] (03Merged) 10jenkins-bot: shellbox-constraints: end single-replica 8.3 pilot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187153 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [00:44:30] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1209* gradually with 4 steps - Work done [00:44:37] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [00:44:43] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [00:45:07] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.306 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:45:30] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [00:45:34] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [00:45:48] !log finished single-replica PHP 8.3 pilot on shellbox-constraints - T403284 [00:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:51] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [00:46:17] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:47:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T402763)', diff saved to https://phabricator.wikimedia.org/P83223 and previous config saved to /var/cache/conftool/dbconfig/20250911-004705-ladsgroup.json [00:47:10] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [01:09:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:10:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:35:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:04:06] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:05:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:07:17] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale-full only: 1 (bast3007), Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [02:10:05] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.038 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:26:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:28:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:31:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.346 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:41:17] RESOLVED: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:10:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:11:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:15:03] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:15:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:20:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:21:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:56:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:01:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.540 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:08:58] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182927 (https://phabricator.wikimedia.org/T375979) (owner: 10Srishakatux) [05:10:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:33:58] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T0600). [06:04:06] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:22:48] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 7679 [06:23:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 7679 [06:28:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:54:06] 06SRE, 06Data-Engineering, 06Traffic: Add pageview information to turnilo's webrequest_sampled_live (is_pageview is always "-") - https://phabricator.wikimedia.org/T402612#11170835 (10JAllemandou) [07:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:40] ah, late. [07:02:06] I'll start the deployment. [07:02:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182927 (https://phabricator.wikimedia.org/T375979) (owner: 10Srishakatux) [07:03:17] (03CR) 10Brouberol: "You need to bump the chart version as well, otherwise this won't reach chartmuseum and thus will not be published as it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187109 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [07:03:17] (03Merged) 10jenkins-bot: Add namespace alias for scn wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182927 (https://phabricator.wikimedia.org/T375979) (owner: 10Srishakatux) [07:03:55] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1182927|Add namespace alias for scn wiki (T375979)]] [07:04:00] T375979: Namespace aliases for scn.wikipedia - https://phabricator.wikimedia.org/T375979 [07:08:49] !log kartik@deploy1003 srishakatux, kartik: Backport for [[gerrit:1182927|Add namespace alias for scn wiki (T375979)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:10:09] !log kartik@deploy1003 srishakatux, kartik: Continuing with sync [07:15:21] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182927|Add namespace alias for scn wiki (T375979)]] (duration: 11m 26s) [07:15:27] T375979: Namespace aliases for scn.wikipedia - https://phabricator.wikimedia.org/T375979 [07:17:12] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:17:17] I was reading https://www.mediawiki.org/wiki/Gerrit/Privilege_policy and was wondering if accounts that don't (and never did) contribute should be removed from the respective extension. (I did that in the past, but now I would need to file a request on phabricator and it seems a bit unfortune to file a request that is connect to the real name). [07:31:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:35:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:41:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:44:27] (03CR) 10Muehlenhoff: [C:03+2] Create component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187019 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [07:44:59] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.211 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:46:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:46:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:47:48] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [07:52:31] !log upload bacula 9.6.7-7+wmf13u1 to component/bacula9 for trixie-wikimedia T404114 [07:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:35] T404114: Trixie bacula-fd package incompatible with our bacula installation - https://phabricator.wikimedia.org/T404114 [07:57:47] (03CR) 10Muehlenhoff: "We're missing the default for cloud.yaml, otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [08:02:37] (03PS1) 10Majavah: kubeadm: Fix use of deprecated fact [puppet] - 10https://gerrit.wikimedia.org/r/1187365 [08:03:35] (03CR) 10Slyngshede: [C:03+1] admin: add johannesrichterwmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187094 (https://phabricator.wikimedia.org/T404080) (owner: 10CDobbins) [08:04:58] (03CR) 10Filippo Giunchedi: [C:03+1] kubeadm: Fix use of deprecated fact [puppet] - 10https://gerrit.wikimedia.org/r/1187365 (owner: 10Majavah) [08:05:32] (03CR) 10Majavah: [C:03+2] kubeadm: Fix use of deprecated fact [puppet] - 10https://gerrit.wikimedia.org/r/1187365 (owner: 10Majavah) [08:06:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:11:15] !log kick off full OSM import for the new maps cluster in eqiad T381565 [08:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:20] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [08:17:43] !log installing systemd bugfix updates on trixie [08:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:25] (03CR) 10Muehlenhoff: [C:03+1] admin: add johannesrichterwmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187094 (https://phabricator.wikimedia.org/T404080) (owner: 10CDobbins) [08:29:41] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-worker1233.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:30:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:32:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:36:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 9.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:37:29] (03PS4) 10Slyngshede: P:cache:haproxy add fetch_is_datacenter lookup [puppet] - 10https://gerrit.wikimedia.org/r/1182763 (https://phabricator.wikimedia.org/T398161) [08:37:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:38:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1233.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:38:44] (03PS4) 10Arnaudb: Revert^4 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1186512 [08:38:44] (03CR) 10Arnaudb: [C:03+2] "@dzahn@wikimedia.org suggested I add prometheus1005 to the `Hosts:` field. Thanks for this, I've been able to see with PCC that mtail scra" [puppet] - 10https://gerrit.wikimedia.org/r/1186512 (owner: 10Arnaudb) [08:39:59] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:41:28] (03CR) 10Slyngshede: P:cache:haproxy add fetch_is_datacenter lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182763 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:41:33] (03PS1) 10Arnaudb: Revert^5 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1187373 [08:42:49] 06SRE, 10CXServer, 10envoy, 10LPL Essential (2025 Jul-Sep), 10LPL Projects (Other): Allow proxy server to accept another valid http header instead of 'HOST' - https://phabricator.wikimedia.org/T404291#11171160 (10Nikerabbit) Not sure under which team envoy falls in. It isn't listed in #sre. We need help... [08:56:45] btullis@cumin1003 provision (PID 2075281) is awaiting input [09:00:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [09:00:35] (03PS2) 10Slyngshede: P:cache:haproxy add datacenter information to provenance [puppet] - 10https://gerrit.wikimedia.org/r/1182782 (https://phabricator.wikimedia.org/T398161) [09:01:20] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-worker1234.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:01:43] (03CR) 10Vgutierrez: P:cache:haproxy add fetch_is_datacenter lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182763 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:03:57] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2240 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1187380 (https://phabricator.wikimedia.org/T404299) [09:04:26] btullis@cumin1003 provision (PID 2075281) is awaiting input [09:05:06] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T404299 [09:05:34] T404299: Switchover s4 master (db2179 -> db2240) - https://phabricator.wikimedia.org/T404299 [09:07:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2240 from API/vslow/dump T404299', diff saved to https://phabricator.wikimedia.org/P83224 and previous config saved to /var/cache/conftool/dbconfig/20250911-090708-fceratto.json [09:07:49] !log upgrading haproxykafka to v0.3.16 on cp3066 to test new feature (https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_requests/101) (T403176) [09:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:53] T403176: Missing Message and Hostname fields in messages sent to DLQ - https://phabricator.wikimedia.org/T403176 [09:10:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [09:11:46] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2240 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1187380 (https://phabricator.wikimedia.org/T404299) (owner: 10Gerrit maintenance bot) [09:12:52] !log Starting s4 codfw failover from db2179 to db2240 - T404299 [09:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:56] T404299: Switchover s4 master (db2179 -> db2240) - https://phabricator.wikimedia.org/T404299 [09:13:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2240 to s4 primary T404299', diff saved to https://phabricator.wikimedia.org/P83225 and previous config saved to /var/cache/conftool/dbconfig/20250911-091347-fceratto.json [09:16:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2179 T404299', diff saved to https://phabricator.wikimedia.org/P83226 and previous config saved to /var/cache/conftool/dbconfig/20250911-091626-fceratto.json [09:19:55] (03CR) 10Stevemunene: [C:03+2] druid: Bring druid1012.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/1182698 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene) [09:20:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2179.codfw.wmnet with reason: Maintenance [09:21:32] btullis@cumin1003 provision (PID 2075281) is awaiting input [09:22:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2179.codfw.wmnet with reason: Maintenance [09:25:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [09:25:32] (03PS7) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [09:25:59] (03CR) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [09:29:02] !log upgrading haproxykafka to v0.3.16 on A:cp to test new feature (https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_requests/101) (T403176) [09:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:06] T403176: Missing Message and Hostname fields in messages sent to DLQ - https://phabricator.wikimedia.org/T403176 [09:30:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [09:32:16] (03PS1) 10Phuedx: WikimediaEvents: Disable client-side error logging for certain wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187382 (https://phabricator.wikimedia.org/T400068) [09:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:34:09] 06SRE, 06Infrastructure-Foundations, 10netops: mr1 port utilization alerts shouldn't mention hash page in their IRC logs - https://phabricator.wikimedia.org/T281055#11171385 (10ayounsi) 05Open→03Resolved Those alerts got moved to AM for the core routers and switches. They are not alerting for managem... [09:37:13] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [09:37:20] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11171402 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie [09:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:43:58] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T404274 [09:44:02] T404274: Switchover s4 master (db1160 -> db1244) - https://phabricator.wikimedia.org/T404274 [09:44:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set db1244 with weight 0 T404274', diff saved to https://phabricator.wikimedia.org/P83227 and previous config saved to /var/cache/conftool/dbconfig/20250911-094414-ladsgroup.json [09:44:48] (03PS4) 10Stevemunene: druid: Bring druid1013.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/1182699 (https://phabricator.wikimedia.org/T397441) [09:44:49] (03PS4) 10Stevemunene: druid: Add druid druid101[2-3] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1182700 (https://phabricator.wikimedia.org/T397441) [09:44:49] (03PS3) 10Stevemunene: druid: remove druid100[7-8] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/1185840 (https://phabricator.wikimedia.org/T403801) [09:46:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:49:59] (03CR) 10Stevemunene: [C:03+2] druid: Bring druid1013.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/1182699 (https://phabricator.wikimedia.org/T397441) (owner: 10Stevemunene) [09:50:54] (03PS1) 10Majavah: offboard-user: Add acl*wmcs-team to privileged groups [puppet] - 10https://gerrit.wikimedia.org/r/1187384 [09:51:25] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1187131 (https://phabricator.wikimedia.org/T404274) [09:51:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:51:29] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1187131 (https://phabricator.wikimedia.org/T404274) (owner: 10Gerrit maintenance bot) [09:51:30] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1281.eqiad.wmnet, wikikube-worker1036.eqiad.wmne [09:51:30] ube-worker1029.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worker1149.eqiad.wmnet, wikikube-worker1313.eqiad.wmnet, wikikube-worker1287.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1015.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker116 [09:51:30] wmnet, wikikube-worker1106.eqiad.wmnet, wikikube-worker1119.eqiad.wmnet, wikikube-worker1102.eqiad.wmnet, wikikube-worker1162.eqiad.wmnet, wikikube-worker1098.eqiad.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [09:51:57] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:52:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:52:30] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1280.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1108.eqiad.wmnet, wikikube-worker1121.eqiad.wmnet, wikikube-worker1116.eqiad.wmne [09:52:30] ube-worker1281.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1161.eqiad.wmnet, wikikube-worker1053.eqiad.wmnet, wikikube-worker1307.eqiad.wmnet, wikikube-worker1313.eqiad.wmnet, wikikube-worker1056.eqiad.wmnet, wikikube-worker100 [09:52:30] wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [09:52:39] sigh [09:52:53] ~incidents [09:52:56] !incidents [09:52:57] 6729 (UNACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [09:52:57] 6723 (RESOLVED) db1236 (paged)/MariaDB Replica SQL: s7 (paged) [09:52:57] !incidents [09:52:57] 6724 (RESOLVED) db1236 (paged)/MariaDB Replica IO: s7 (paged) [09:52:58] 6725 (RESOLVED) db1236 (paged)/mysqld processes (paged) [09:52:58] 6729 (UNACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [09:52:58] 6723 (RESOLVED) db1236 (paged)/MariaDB Replica SQL: s7 (paged) [09:52:58] 6724 (RESOLVED) db1236 (paged)/MariaDB Replica IO: s7 (paged) [09:52:58] 6725 (RESOLVED) db1236 (paged)/mysqld processes (paged) [09:53:02] !ack 6729 [09:53:03] 6729 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [09:53:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:53:43] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [09:53:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [09:53:53] !incidents [09:53:54] 6729 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [09:53:54] 6730 (UNACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [09:53:54] PROBLEM - Druid historical on druid1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:53:54] 6731 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [09:53:54] Around if needed [09:53:54] 6723 (RESOLVED) db1236 (paged)/MariaDB Replica SQL: s7 (paged) [09:53:54] 6724 (RESOLVED) db1236 (paged)/MariaDB Replica IO: s7 (paged) [09:53:55] 6725 (RESOLVED) db1236 (paged)/mysqld processes (paged) [09:54:01] !ack 6731 [09:54:01] 6731 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [09:54:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:54:57] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Can you take care of the deployment?" [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [09:55:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:55:46] it's not traffic patterns, I do some extra load on RW esX clusters [09:55:53] but not world ending load [09:56:54] PROBLEM - Druid middlemanager on druid1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:56:57] FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:57:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:58:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:58:58] FIRING: [2x] ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:59:18] !incidents [09:59:19] 6729 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [09:59:19] 6730 (UNACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [09:59:19] 6731 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [09:59:19] 6723 (RESOLVED) db1236 (paged)/MariaDB Replica SQL: s7 (paged) [09:59:20] 6724 (RESOLVED) db1236 (paged)/MariaDB Replica IO: s7 (paged) [09:59:20] 6725 (RESOLVED) db1236 (paged)/mysqld processes (paged) [09:59:28] !ack 6730 [09:59:29] 6730 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [09:59:51] FIRING: [13x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:00:11] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [10:00:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:00:19] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11171531 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors: - srete... [10:00:58] RECOVERY - Druid historical on druid1012 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:00:58] RECOVERY - Druid middlemanager on druid1012 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:01:36] FIRING: [2x] RESTGatewayBackendErrorsHigh: rest-gateway: high 5xx errors from mobileapps_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DRESTGatewayBackendErrorsHigh [10:01:57] FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:02:27] !incidents [10:02:28] 6729 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [10:02:28] 6730 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [10:02:28] 6731 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [10:02:28] 6732 (UNACKED) [2x] RESTGatewayBackendErrorsHigh sre (rest-gateway eqiad) [10:02:29] 6723 (RESOLVED) db1236 (paged)/MariaDB Replica SQL: s7 (paged) [10:02:29] 6724 (RESOLVED) db1236 (paged)/MariaDB Replica IO: s7 (paged) [10:02:29] 6725 (RESOLVED) db1236 (paged)/mysqld processes (paged) [10:02:32] !ack 6732 [10:02:33] 6732 (ACKED) [2x] RESTGatewayBackendErrorsHigh sre (rest-gateway eqiad) [10:02:37] rest-gateway is a side effect [10:02:45] FIRING: [2x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [10:03:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Switch db2237 and db2179 weights', diff saved to https://phabricator.wikimedia.org/P83228 and previous config saved to /var/cache/conftool/dbconfig/20250911-100328-fceratto.json [10:03:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T404274', diff saved to https://phabricator.wikimedia.org/P83229 and previous config saved to /var/cache/conftool/dbconfig/20250911-100348-ladsgroup.json [10:04:00] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2179 gradually with 4 steps - Pooling in [10:04:06] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:04:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:04:51] FIRING: [13x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:04:53] federico3: heads up on the outage if you're doing more db maintenance [10:05:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:05:30] hnowlan: thanks [10:05:50] !incidents [10:05:50] 6729 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [10:05:51] 6730 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [10:05:51] 6731 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [10:05:51] 6732 (ACKED) [2x] RESTGatewayBackendErrorsHigh sre (rest-gateway eqiad) [10:05:51] 6733 (UNACKED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [10:05:51] 6723 (RESOLVED) db1236 (paged)/MariaDB Replica SQL: s7 (paged) [10:05:51] 6724 (RESOLVED) db1236 (paged)/MariaDB Replica IO: s7 (paged) [10:05:52] 6725 (RESOLVED) db1236 (paged)/mysqld processes (paged) [10:06:02] !ack 6733 [10:06:03] 6733 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [10:06:57] FIRING: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:07:09] !incidents [10:07:10] 6729 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [10:07:10] 6730 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [10:07:10] 6731 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [10:07:10] 6732 (ACKED) [2x] RESTGatewayBackendErrorsHigh sre (rest-gateway eqiad) [10:07:10] 6733 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [10:07:11] 6723 (RESOLVED) db1236 (paged)/MariaDB Replica SQL: s7 (paged) [10:07:11] 6724 (RESOLVED) db1236 (paged)/MariaDB Replica IO: s7 (paged) [10:07:11] 6725 (RESOLVED) db1236 (paged)/mysqld processes (paged) [10:07:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:07:58] !incidents [10:07:59] 6729 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [10:07:59] 6730 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [10:07:59] 6731 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [10:07:59] 6732 (ACKED) [2x] RESTGatewayBackendErrorsHigh sre (rest-gateway eqiad) [10:08:00] 6733 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet) [10:08:00] 6734 (UNACKED) [3x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [10:08:00] 6723 (RESOLVED) db1236 (paged)/MariaDB Replica SQL: s7 (paged) [10:08:00] 6724 (RESOLVED) db1236 (paged)/MariaDB Replica IO: s7 (paged) [10:08:00] 6725 (RESOLVED) db1236 (paged)/mysqld processes (paged) [10:08:03] !ack 6734 [10:08:04] 6734 (ACKED) [3x] ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet) [10:08:09] Gotta love cascading failures [10:08:33] my phone is ringing and interrupting me a LOT [10:08:58] FIRING: [3x] ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:09:51] FIRING: [10x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:09:53] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1000) [10:10:14] please do not deploy anything for the time being [10:10:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 2.374s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:10:30] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:11:30] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:11:36] FIRING: [2x] RESTGatewayBackendErrorsHigh: rest-gateway: high 5xx errors from mobileapps_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DRESTGatewayBackendErrorsHigh [10:11:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [10:11:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:11:57] RESOLVED: [2x] ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:12:15] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:12:45] FIRING: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:12:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote db1244 to s4 primary and set section read-write T404274', diff saved to https://phabricator.wikimedia.org/P83231 and previous config saved to /var/cache/conftool/dbconfig/20250911-101247-ladsgroup.json [10:12:50] RESOLVED: [2x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [10:12:55] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:13:15] FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:13:44] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [10:13:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [10:13:58] FIRING: [3x] ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:14:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:15:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 1.121s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:15:36] (03PS2) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1187132 (https://phabricator.wikimedia.org/T404274) [10:15:37] (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1187132 (https://phabricator.wikimedia.org/T404274) (owner: 10Gerrit maintenance bot) [10:15:38] (03CR) 10Ladsgroup: [V:03+2 C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1187132 (https://phabricator.wikimedia.org/T404274) (owner: 10Gerrit maintenance bot) [10:15:56] !log ladsgroup@dns1004 START - running authdns-update [10:16:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [10:16:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:17:04] !log ladsgroup@dns1004 END - running authdns-update [10:17:45] RESOLVED: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:18:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:19:51] FIRING: [9x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:21:18] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: sync [10:21:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:21:57] PROBLEM - Druid middlemanager on druid1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:21:57] PROBLEM - Druid historical on druid1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:21:59] PROBLEM - Druid middlemanager on druid1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:21:59] PROBLEM - Druid historical on druid1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:22:06] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: sync [10:22:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1160 T404274', diff saved to https://phabricator.wikimedia.org/P83233 and previous config saved to /var/cache/conftool/dbconfig/20250911-102232-ladsgroup.json [10:22:37] T404274: Switchover s4 master (db1160 -> db1244) - https://phabricator.wikimedia.org/T404274 [10:23:10] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1160.eqiad.wmnet [10:23:18] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1160 - Upgrading db1160.eqiad.wmnet [10:23:25] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1160 - Upgrading db1160.eqiad.wmnet [10:26:36] RESOLVED: [2x] RESTGatewayBackendErrorsHigh: rest-gateway: high 5xx errors from mobileapps_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DRESTGatewayBackendErrorsHigh [10:28:44] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1160.eqiad.wmnet [10:29:07] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1176.eqiad.wmnet [10:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:30:17] !log roll-restarting proton@eqiad [10:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:22] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: sync [10:31:17] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: sync [10:33:59] (03CR) 10Jcrespo: "Yes, sorry, got distracted about ongoing outage. will take care." [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [10:34:14] (03CR) 10Jcrespo: [C:03+2] bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [10:34:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:34:57] RECOVERY - Druid middlemanager on druid1012 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:34:57] RECOVERY - Druid historical on druid1012 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:35:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1234.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:35:36] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1176.eqiad.wmnet [10:35:59] RECOVERY - Druid historical on druid1013 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:35:59] RECOVERY - Druid middlemanager on druid1013 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:37:03] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db2230.codfw.wmnet [10:42:21] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2230.codfw.wmnet [10:45:32] jouncebot: nowandnext [10:45:32] For the next 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1000) [10:45:33] In 1 hour(s) and 14 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1200) [10:46:57] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1187389 (https://phabricator.wikimedia.org/T404326) [10:47:03] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1187390 (https://phabricator.wikimedia.org/T404326) [10:47:23] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [10:47:29] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11171801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie [10:49:27] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2179 gradually with 4 steps - Pooling in [10:51:18] Please check https://www.wikimediastatus.net/ for soon-to-be-started maintenance, affecting mainly English Wikipedia [10:55:15] (03PS5) 10Slyngshede: P:cache:haproxy add fetch_is_datacenter lookup [puppet] - 10https://gerrit.wikimedia.org/r/1182763 (https://phabricator.wikimedia.org/T398161) [10:55:57] PROBLEM - Druid historical on druid1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:55:57] PROBLEM - Druid middlemanager on druid1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:56:25] (03PS6) 10Slyngshede: P:cache:haproxy add is_datacenter Lua action [puppet] - 10https://gerrit.wikimedia.org/r/1182763 (https://phabricator.wikimedia.org/T398161) [10:57:59] PROBLEM - Druid middlemanager on druid1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:58:59] PROBLEM - Druid historical on druid1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:59:02] !log ladsgroup@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T404326 [10:59:06] T404326: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T404326 [10:59:42] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T404326 [11:00:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set db1163 with weight 0 T404326', diff saved to https://phabricator.wikimedia.org/P83236 and previous config saved to /var/cache/conftool/dbconfig/20250911-105959-ladsgroup.json [11:00:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T404326', diff saved to https://phabricator.wikimedia.org/P83237 and previous config saved to /var/cache/conftool/dbconfig/20250911-110036-ladsgroup.json [11:01:15] (03CR) 10Slyngshede: P:cache:haproxy add is_datacenter Lua action (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182763 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [11:03:32] (03PS3) 10Slyngshede: P:cache:haproxy add datacenter information to provenance [puppet] - 10https://gerrit.wikimedia.org/r/1182782 (https://phabricator.wikimedia.org/T398161) [11:04:03] (03PS4) 10Slyngshede: P:cache:haproxy add datacenter information to provenance [puppet] - 10https://gerrit.wikimedia.org/r/1182782 (https://phabricator.wikimedia.org/T398161) [11:04:57] RECOVERY - Druid middlemanager on druid1012 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:04:57] RECOVERY - Druid historical on druid1012 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:04:59] RECOVERY - Druid middlemanager on druid1013 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:04:59] RECOVERY - Druid historical on druid1013 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:06:01] these ones are new hosts --^ [11:06:12] cc: stevemunene (some noise from alerts) [11:06:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:06:36] yeah they've been quite noisy throughout this, probably too late but if you could silence them it'd be handy [11:06:50] thanks luca adding a silence, new hosts being added [11:07:01] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1187389 (https://phabricator.wikimedia.org/T404326) (owner: 10Gerrit maintenance bot) [11:07:17] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:07:33] !log Starting s1 eqiad failover from db1184 to db1163 - T404326 [11:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:38] T404326: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T404326 [11:08:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote db1163 to s1 primary and set section read-write T404326', diff saved to https://phabricator.wikimedia.org/P83238 and previous config saved to /var/cache/conftool/dbconfig/20250911-110821-ladsgroup.json [11:08:58] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:09:16] !log stevemunene@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on druid[1012-1013].eqiad.wmnet with reason: New druid_public hosts in setup [11:10:48] (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1187390 (https://phabricator.wikimedia.org/T404326) (owner: 10Gerrit maintenance bot) [11:11:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:12:02] !log ladsgroup@dns1004 START - running authdns-update [11:13:10] !log ladsgroup@dns1004 END - running authdns-update [11:15:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1184 T404326', diff saved to https://phabricator.wikimedia.org/P83239 and previous config saved to /var/cache/conftool/dbconfig/20250911-111545-ladsgroup.json [11:15:50] T404326: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T404326 [11:16:55] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1184.eqiad.wmnet [11:17:03] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1184 - Upgrading db1184.eqiad.wmnet [11:17:10] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1184 - Upgrading db1184.eqiad.wmnet [11:22:58] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1184.eqiad.wmnet [11:25:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:27:26] (03PS1) 10Vgutierrez: haproxy: Decode header content before logging [puppet] - 10https://gerrit.wikimedia.org/r/1187399 (https://phabricator.wikimedia.org/T401383) [11:27:45] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187399 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [11:27:53] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance [11:28:55] (03PS2) 10Vgutierrez: haproxy: Decode header content before logging [puppet] - 10https://gerrit.wikimedia.org/r/1187399 (https://phabricator.wikimedia.org/T401383) [11:30:30] !log installing shadow security updates [11:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:51] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1160.eqiad.wmnet with reason: Maintenance [11:34:19] btullis@cumin1003 provision (PID 2091711) is awaiting input [11:39:37] !log installing apache2 security updates [11:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:50] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-worker1235.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:48:52] btullis@cumin1003 provision (PID 2091711) is awaiting input [11:49:59] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:57:31] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1200) [12:00:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1235.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:01:05] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Update [12:02:28] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1215.eqiad.wmnet [12:05:07] (03PS1) 10Marco Fossati: ReaderExperiments' ImageBrowsing stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403255) [12:05:57] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187414 [12:06:04] (03PS2) 10Brouberol: dumps: disable rsync access for 2 dead dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1187016 (https://phabricator.wikimedia.org/T402987) (owner: 10Xcollazo) [12:06:06] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187016 (https://phabricator.wikimedia.org/T402987) (owner: 10Xcollazo) [12:08:07] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1215.eqiad.wmnet [12:08:08] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187414 (owner: 10PipelineBot) [12:09:22] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking), 10Wikipedia-iOS-App-Backlog (iOS Release FY2025-26): [QA Task] Verify iOS compatability with removal of m. subdomain on test wiki - https://phabricator.wikimedia.org/T404275#11172095 (10Seddon) [12:09:51] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187414 (owner: 10PipelineBot) [12:09:58] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Update [12:10:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:10:36] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:10:44] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Update [12:11:42] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:11:51] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:12:11] 07sre-alert-triage, 10Data-Platform-SRE (2025.09.05 - 2025.09.26), 07Essential-Work: Alert in need of triage: PybalBackendDown (instance cirrussearch2091:0) - https://phabricator.wikimedia.org/T399161#11172127 (10Gehel) [12:12:27] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:13:02] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:13:15] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:13:51] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:14:04] mvernon@cumin2002 reimage (PID 2845089) is awaiting input [12:15:09] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.833 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:16:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:16:50] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.09.05 - 2025.09.26), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11172189 (10Gehel) [12:18:11] (03CR) 10Brouberol: [C:03+1] dumps: disable rsync access for 2 dead dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1187016 (https://phabricator.wikimedia.org/T402987) (owner: 10Xcollazo) [12:18:23] (03CR) 10Marco Fossati: ReaderExperiments' ImageBrowsing stream configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403255) (owner: 10Marco Fossati) [12:19:53] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Update [12:20:39] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking), 10Wikipedia-iOS-App-Backlog (iOS Release FY2025-26): [QA Task] Verify iOS compatability with removal of m. subdomain on test wiki - https://phabricator.wikimedia.org/T404275#11172263 (10Seddon) [12:21:47] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking), 10Wikipedia-iOS-App-Backlog (iOS Release FY2025-26): [QA Task] Verify iOS compatability with removal of m. subdomain on test wiki - https://phabricator.wikimedia.org/T404275#11172291 (10Seddon) [12:22:19] (03PS1) 10Effie Mouzeli: hcaptcha: switch proxy port to 4260 [puppet] - 10https://gerrit.wikimedia.org/r/1187420 (https://phabricator.wikimedia.org/T403416) [12:23:42] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking), 10Wikipedia-iOS-App-Backlog (iOS Release FY2025-26): [QA Task] Verify iOS compatability with removal of m. subdomain on test wiki - https://phabricator.wikimedia.org/T404275#11172323 (10Seddon) [12:24:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:26:33] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:26:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (PUT flinkdeployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:27:06] btullis@cumin1003 provision (PID 2097714) is awaiting input [12:29:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Upgrade db2207 for semi-sync bug', diff saved to https://phabricator.wikimedia.org/P83240 and previous config saved to /var/cache/conftool/dbconfig/20250911-122956-ladsgroup.json [12:30:21] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db2207.codfw.wmnet [12:30:42] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db2207 - Upgrading db2207.codfw.wmnet [12:30:50] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2207 - Upgrading db2207.codfw.wmnet [12:32:48] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [12:32:57] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11172357 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors: - srete... [12:36:43] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2207.codfw.wmnet [12:36:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PUT flinkdeployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:37:22] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db2207* gradually with 4 steps - Work done [12:38:14] (03PS1) 10Ayounsi: Interface validators: prevent more mistakes on interface naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) [12:39:04] (03CR) 10Ayounsi: "Not tested." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1187427 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi) [12:40:23] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): [QA Task] Verify Android compatability with removal of m. subdomain on test wiki - https://phabricator.wikimedia.org/T404342 (10Seddon) 03NEW [12:40:24] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking), 10Wikipedia-iOS-App-Backlog (iOS Release FY2025-26): [QA Task] Verify iOS compatability with removal of m. subdomain on test wiki - https://phabricator.wikimedia.org/T404275#11172391 (10Seddon) [12:41:32] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11172402 (10Jclark-ctr) a:05Jclark-ctr→03bking [12:41:55] !log push pfw policy - T404256 [12:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.184s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:43:58] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:46:05] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-worker1236.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:48:13] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1236.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:48:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.125s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:50:22] (03PS2) 10Brouberol: flink: build flink 1.20 on top of bookworm/jdk 17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187416 (https://phabricator.wikimedia.org/T403838) [12:51:17] (03CR) 10Ayounsi: [C:03+2] Use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [12:54:01] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1184* gradually with 4 steps - Work done [12:55:07] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.794 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:55:37] (03PS3) 10Vgutierrez: haproxy: Decode header content before logging [puppet] - 10https://gerrit.wikimedia.org/r/1187399 (https://phabricator.wikimedia.org/T401383) [12:55:52] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [12:56:01] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11172459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie [12:56:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:56:37] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187399 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [12:58:54] (03Merged) 10jenkins-bot: Use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:13] o/ [13:00:21] yup, looks like nothing to do :) [13:01:20] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [13:01:27] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11172508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors: - srete... [13:01:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 2.430 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:01:48] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [13:02:00] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11172510 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie [13:03:58] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [13:05:21] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): [QA Task] Verify Android compatability with removal of m. subdomain on test wiki - https://phabricator.wikimedia.org/T404342#11172513 (10402998) a:03402998 Verified Working https://www.mediawiki.org/wiki/Requests_for_comment/Mobil... [13:05:52] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1002 [13:05:53] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1002 [13:08:21] PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [13:09:11] RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Thanos [13:15:22] !log installing imagemagick security updates [13:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:45] mvernon@cumin2002 reimage (PID 2913800) is awaiting input [13:17:22] (03PS1) 10Ayounsi: run_homer fix mistake [cookbooks] - 10https://gerrit.wikimedia.org/r/1187436 [13:17:41] (03CR) 10Xcollazo: "@brouberol@wikimedia.org thank you for the review and the `Hosts` fix." [puppet] - 10https://gerrit.wikimedia.org/r/1187016 (https://phabricator.wikimedia.org/T402987) (owner: 10Xcollazo) [13:18:06] (03CR) 10Brouberol: [C:03+1] "pleasure!" [puppet] - 10https://gerrit.wikimedia.org/r/1187016 (https://phabricator.wikimedia.org/T402987) (owner: 10Xcollazo) [13:18:33] (03CR) 10Ayounsi: [C:03+2] Handle nokia interface name style [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1186985 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi) [13:20:34] (03Merged) 10jenkins-bot: Handle nokia interface name style [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1186985 (https://phabricator.wikimedia.org/T404146) (owner: 10Ayounsi) [13:20:46] (03CR) 10Fabfur: [C:03+1] "did some tests and lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1187399 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [13:22:01] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest1002 [13:22:02] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1002 [13:23:01] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2207* gradually with 4 steps - Work done [13:26:07] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:26:21] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [13:26:26] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [13:26:56] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [13:27:33] (03CR) 10Ayounsi: [C:03+2] "Tested with test-cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/1187436 (owner: 10Ayounsi) [13:29:40] (03PS1) 10Kosta Harlan: hCaptcha: Special handling for hcaptcha-secure-api.js requests [puppet] - 10https://gerrit.wikimedia.org/r/1187439 (https://phabricator.wikimedia.org/T404251) [13:31:32] (03PS1) 10Btullis: Revert^2 "Fix the partman recipe for dse-k8s-worker1014" [puppet] - 10https://gerrit.wikimedia.org/r/1187440 [13:32:07] !log depool cp7001 [13:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:34] (03CR) 10Vgutierrez: [C:03+2] haproxy: Decode header content before logging [puppet] - 10https://gerrit.wikimedia.org/r/1187399 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [13:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:34:51] (03Merged) 10jenkins-bot: run_homer fix mistake [cookbooks] - 10https://gerrit.wikimedia.org/r/1187436 (owner: 10Ayounsi) [13:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:28] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1184* gradually with 4 steps - Work done [13:39:39] (03CR) 10Hnowlan: [C:03+1] hcaptcha: switch proxy port to 4260 [puppet] - 10https://gerrit.wikimedia.org/r/1187420 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [13:42:53] (03PS1) 10Vgutierrez: haproxy: Don't double escape headerss decoded with json [puppet] - 10https://gerrit.wikimedia.org/r/1187442 (https://phabricator.wikimedia.org/T401383) [13:44:03] (03PS2) 10Vgutierrez: haproxy: Don't double escape headerss decoded with json [puppet] - 10https://gerrit.wikimedia.org/r/1187442 (https://phabricator.wikimedia.org/T401383) [13:44:09] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-serve1012.eqiad.wmnet [13:44:38] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187442 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [13:45:15] (03CR) 10Fabfur: [C:03+1] haproxy: Don't double escape headerss decoded with json [puppet] - 10https://gerrit.wikimedia.org/r/1187442 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [13:46:23] (03PS1) 10Scott French: shellbox-constraints: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187162 (https://phabricator.wikimedia.org/T403284) [13:47:11] (03CR) 10Vgutierrez: [C:03+2] haproxy: Don't double escape headerss decoded with json [puppet] - 10https://gerrit.wikimedia.org/r/1187442 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [13:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:49:21] (03CR) 10Clément Goubert: [C:03+1] shellbox-constraints: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187162 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [13:49:46] (03PS1) 10Effie Mouzeli: hcaptcha: add healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1187444 (https://phabricator.wikimedia.org/T403416) [13:50:12] !log installing kitty security updates [13:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:49] (03PS1) 10Ssingh: conftool-data: add proxoid (hcaptcha urldownloader) [puppet] - 10https://gerrit.wikimedia.org/r/1187445 (https://phabricator.wikimedia.org/T403416) [13:52:10] (03CR) 10Btullis: [C:03+2] Revert^2 "Fix the partman recipe for dse-k8s-worker1014" [puppet] - 10https://gerrit.wikimedia.org/r/1187440 (owner: 10Btullis) [13:52:14] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1012.eqiad.wmnet [13:52:41] jouncebot: nowandnext [13:52:41] For the next 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1300) [13:52:41] In 0 hour(s) and 37 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1430) [13:54:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:54:18] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11172734 (10BTullis) >>! In T399779#11169380, @Jclark-ctr wrote: > @bking If you could update the Partman for EFI booting — it was originally set up for Legacy. I had requested the change to EFI... [13:55:44] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1217.eqiad.wmnet with reason: Reboot [13:55:50] (03CR) 10Effie Mouzeli: [C:03+1] conftool-data: add proxoid (hcaptcha urldownloader) [puppet] - 10https://gerrit.wikimedia.org/r/1187445 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [13:55:58] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187162 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [13:56:04] (03PS1) 10Ssingh: conftool-data: add proxoid to services [puppet] - 10https://gerrit.wikimedia.org/r/1187448 (https://phabricator.wikimedia.org/T403416) [13:56:20] (03CR) 10Scott French: [C:03+2] shellbox-constraints: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187162 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [13:56:27] 06SRE, 10SRE-swift-storage: swift_disks fact needs to cope with change in /dev/disk/by-path in trixie - https://phabricator.wikimedia.org/T404351 (10MatthewVernon) 03NEW [13:56:46] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2006 [13:56:53] !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host sretest2006 [13:56:59] (03CR) 10Ssingh: [C:03+2] conftool-data: add proxoid (hcaptcha urldownloader) [puppet] - 10https://gerrit.wikimedia.org/r/1187445 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [13:57:08] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2006 [13:57:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2006 [13:58:04] (03Merged) 10jenkins-bot: shellbox-constraints: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187162 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [13:58:35] !log sukhe@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: cluster=proxoid,service=nginx [reason: setting weight for proxoid] [13:59:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:00:42] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: Maintenance [14:01:17] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [14:01:25] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [14:01:35] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: Maintenance [14:01:55] (03PS1) 10Joal: Fix raw webrequest data purge job [puppet] - 10https://gerrit.wikimedia.org/r/1187450 (https://phabricator.wikimedia.org/T386177) [14:02:56] (03CR) 10Effie Mouzeli: "PCC ok https://puppet-compiler.wmflabs.org/output/1187444/6887/urldownloader1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1187444 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [14:03:34] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2006 [14:03:38] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [14:03:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2006 [14:03:57] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:04:00] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [14:04:06] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:04:11] PROBLEM - haproxy failover on dbproxy1028 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:04:13] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:04:13] PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:04:18] huh? [14:04:26] (03CR) 10Effie Mouzeli: [C:03+2] "PCCC ok" [puppet] - 10https://gerrit.wikimedia.org/r/1187420 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [14:04:33] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:04:33] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:04:37] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [14:04:57] PROBLEM - haproxy failover on dbproxy1029 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:04:57] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:06:03] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2006 [14:06:33] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:06:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2006 [14:06:57] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:07:33] RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:07:36] (03PS1) 10Muehlenhoff: maps1011: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1187453 (https://phabricator.wikimedia.org/T381565) [14:07:57] RECOVERY - haproxy failover on dbproxy1029 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:07:57] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:08:11] RECOVERY - haproxy failover on dbproxy1028 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:08:13] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:08:13] RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:08:13] 10SRE-tools, 06Infrastructure-Foundations: secure-cookbook doesn't allow for --dry-run - https://phabricator.wikimedia.org/T404355 (10ayounsi) 03NEW [14:08:20] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding proxoid service IPs - sukhe@cumin1003" [14:08:24] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding proxoid service IPs - sukhe@cumin1003" [14:08:24] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:08:42] (03PS1) 10Ssingh: wmnet: add proxoid svc IPs [dns] - 10https://gerrit.wikimedia.org/r/1187454 (https://phabricator.wikimedia.org/T403416) [14:08:54] (03PS1) 10Vgutierrez: haproxy: Revert header content decoding [puppet] - 10https://gerrit.wikimedia.org/r/1187455 (https://phabricator.wikimedia.org/T401383) [14:09:56] (03PS2) 10Vgutierrez: haproxy: Revert header content decoding [puppet] - 10https://gerrit.wikimedia.org/r/1187455 (https://phabricator.wikimedia.org/T401383) [14:11:51] (03CR) 10Ssingh: [C:03+1] "Looks good, will use this for the healthcheck." [puppet] - 10https://gerrit.wikimedia.org/r/1187444 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [14:13:11] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356 (10MatthewVernon) 03NEW [14:13:33] (03PS2) 10Kosta Harlan: hCaptcha: Special handling for hcaptcha-secure-api.js requests [puppet] - 10https://gerrit.wikimedia.org/r/1187439 (https://phabricator.wikimedia.org/T404251) [14:13:43] (03PS1) 10Muehlenhoff: Make maps2012-2014 replica nodes [puppet] - 10https://gerrit.wikimedia.org/r/1187457 (https://phabricator.wikimedia.org/T381565) [14:13:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:43] (03PS3) 10Vgutierrez: haproxy: Revert header content decoding [puppet] - 10https://gerrit.wikimedia.org/r/1187455 (https://phabricator.wikimedia.org/T401383) [14:14:49] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox: Updates for Nokia switch support - https://phabricator.wikimedia.org/T404146#11172893 (10ayounsi) Above patch worked successfully: https://netbox.wikimedia.org/extras/changelog/?request_id=7940ab40-742b-47fb-98c6-fba8e4e2989b Howev... [14:16:12] (03CR) 10Vgutierrez: [C:03+1] "addresses match netbox content" [dns] - 10https://gerrit.wikimedia.org/r/1187454 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [14:16:34] (03CR) 10Ssingh: [C:03+2] wmnet: add proxoid svc IPs [dns] - 10https://gerrit.wikimedia.org/r/1187454 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [14:16:46] !log sukhe@dns1004 START - running authdns-update [14:17:21] (03CR) 10Kosta Harlan: "Can we use this in T404204, as a lightweight check from MW before setting the captcha instance to hCaptcha?" [puppet] - 10https://gerrit.wikimedia.org/r/1187444 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [14:17:47] (03CR) 10Fabfur: [C:03+1] "at glance they look the same set on netbox" [dns] - 10https://gerrit.wikimedia.org/r/1187454 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [14:17:48] (03PS3) 10Andrew Bogott: Ceph rbd: remove option to use 'civetweb' front-end [puppet] - 10https://gerrit.wikimedia.org/r/1186649 [14:17:51] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1186649 (owner: 10Andrew Bogott) [14:17:55] !log sukhe@dns1004 END - running authdns-update [14:18:36] (03CR) 10Effie Mouzeli: "this is testing the proxy is alive, however, it does not test the proxy can reach hcaptcha.com" [puppet] - 10https://gerrit.wikimedia.org/r/1187444 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [14:18:39] !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache proxoid.svc.eqiad.wmnet on all recursors [14:18:42] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proxoid.svc.eqiad.wmnet on all recursors [14:18:50] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359 (10ECohen_WMDE) 03NEW [14:20:20] (03CR) 10Kosta Harlan: "Yeah. For that we need a request like this https://docs.hcaptcha.com/#integration-testing-test-keys. But that might be too time consuming." [puppet] - 10https://gerrit.wikimedia.org/r/1187444 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [14:20:58] (03PS1) 10Ssingh: serivce.yaml: add proxoid low-traffic service [puppet] - 10https://gerrit.wikimedia.org/r/1187459 (https://phabricator.wikimedia.org/T403416) [14:20:58] (03CR) 10Effie Mouzeli: [C:03+2] hcaptcha: add healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1187444 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [14:21:11] (03CR) 10Fabfur: [C:03+1] haproxy: Revert header content decoding [puppet] - 10https://gerrit.wikimedia.org/r/1187455 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [14:21:14] 06SRE, 10SRE-swift-storage: swift_disks fact needs to cope with change in /dev/disk/by-path in trixie - https://phabricator.wikimedia.org/T404351#11172983 (10MatthewVernon) [14:21:42] (03CR) 10Effie Mouzeli: [C:03+1] conftool-data: add proxoid to services [puppet] - 10https://gerrit.wikimedia.org/r/1187448 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [14:22:14] (03CR) 10Vgutierrez: [C:03+2] haproxy: Revert header content decoding [puppet] - 10https://gerrit.wikimedia.org/r/1187455 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [14:22:19] (03CR) 10Ssingh: [C:03+2] conftool-data: add proxoid to services [puppet] - 10https://gerrit.wikimedia.org/r/1187448 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [14:22:35] vgutierrez: merge your change? [14:22:37] ok to? [14:22:40] Valentin Gutierrez: haproxy: Revert header content decoding (6191b9c07d) [14:22:44] sukhe: go ahead please [14:22:50] done [14:22:51] managers messing with my merges! [14:22:54] what's next? [14:22:56] ;P [14:23:00] "managers" [14:24:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187457 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:24:50] (03CR) 10Muehlenhoff: [C:03+2] maps1011: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1187453 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:25:09] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [14:25:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:25:17] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11173029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors: - srete... [14:25:44] (03CR) 10Stevemunene: [C:03+2] Fix raw webrequest data purge job [puppet] - 10https://gerrit.wikimedia.org/r/1187450 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [14:25:45] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [14:26:03] (03PS1) 10Brouberol: Add a dummy secret file containing the wikiadmin password [labs/private] - 10https://gerrit.wikimedia.org/r/1187463 [14:26:10] (03CR) 10Effie Mouzeli: [C:03+1] serivce.yaml: add proxoid low-traffic service [puppet] - 10https://gerrit.wikimedia.org/r/1187459 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [14:26:16] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [14:26:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:26:37] !log migrated shellbox-constraints to PHP 8.3 - T403284 [14:26:38] (03PS1) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [14:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:41] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [14:27:13] !log repool cp7001 [14:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:56] (03PS2) 10Ssingh: service.yaml: add proxoid low-traffic service [puppet] - 10https://gerrit.wikimedia.org/r/1187459 (https://phabricator.wikimedia.org/T403416) [14:28:04] (03CR) 10Ssingh: "Fixed typo in commit message." [puppet] - 10https://gerrit.wikimedia.org/r/1187459 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [14:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1430) [14:30:09] (03PS2) 10Brouberol: Add a dummy secret file containing the wikiadmin password [labs/private] - 10https://gerrit.wikimedia.org/r/1187463 [14:32:51] !setting db2185 (db_inventory-codfw) as read only T399540 [14:32:51] T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540 [14:37:05] !log disabled puppet on A:cp - T403655 [14:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:09] T403655: Configure mw-next-routing for the PHP 8.3 migration - https://phabricator.wikimedia.org/T403655 [14:38:23] (03CR) 10Scott French: [C:03+2] hieradata: add mw-next-routing to ATS tslua plugin chains [puppet] - 10https://gerrit.wikimedia.org/r/1184915 (https://phabricator.wikimedia.org/T403655) (owner: 10Scott French) [14:39:59] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:41:15] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/device-analytics: sync [14:41:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.242 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:41:31] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/device-analytics: sync [14:41:59] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/device-analytics: sync [14:41:59] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [14:42:13] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/device-analytics: sync [14:43:28] (03PS1) 10Muehlenhoff: Fix replica slot name [puppet] - 10https://gerrit.wikimedia.org/r/1187468 [14:44:27] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [14:45:06] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/edit-analytics: sync [14:45:23] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/edit-analytics: sync [14:45:44] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/edit-analytics: sync [14:45:56] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: sync [14:46:48] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/editor-analytics: sync [14:47:00] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/editor-analytics: sync [14:47:31] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11173184 (10WMDE-leszek) I approve this request on WMDE's behalf. According to our records @ECohen_WMDE's dev account might not be in `nda` and `wmde` LDAP groups yet - I trust WMF SRE a... [14:47:52] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/editor-analytics: sync [14:48:04] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: sync [14:48:12] (03CR) 10Stoyofuku-wmf: "This makes sense! Apparently I have +2 rights in this repo now, but will hold off on approving until 9/22" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187052 (https://phabricator.wikimedia.org/T402048) (owner: 10Jdlrobson) [14:48:14] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2005-dev.codfw.wmnet with OS bookworm [14:48:41] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/geo-analytics: sync [14:48:55] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/geo-analytics: sync [14:49:03] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/geo-analytics: sync [14:49:19] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: sync [14:49:36] (03CR) 10Elukey: [C:03+1] Fix replica slot name [puppet] - 10https://gerrit.wikimedia.org/r/1187468 (owner: 10Muehlenhoff) [14:50:01] (03CR) 10Muehlenhoff: [C:03+2] Fix replica slot name [puppet] - 10https://gerrit.wikimedia.org/r/1187468 (owner: 10Muehlenhoff) [14:50:08] (03PS2) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [14:50:17] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [14:50:24] (03CR) 10DCausse: [C:03+1] flink: build flink 1.20 on top of bookworm/jdk 17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187416 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [14:50:43] (03CR) 10Brouberol: [C:03+2] flink: build flink 1.20 on top of bookworm/jdk 17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187416 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [14:50:47] (03CR) 10Brouberol: [V:03+2 C:03+2] flink: build flink 1.20 on top of bookworm/jdk 17 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1187416 (https://phabricator.wikimedia.org/T403838) (owner: 10Brouberol) [14:51:07] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/image-suggestion: sync [14:51:22] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/image-suggestion: sync [14:51:24] 06SRE, 06Traffic: Setting up Wikimedia Trust and Safety Help Center with Zendesk product: Seeking Guidance on host mapping - https://phabricator.wikimedia.org/T400952#11173200 (10ssingh) @JAbrams and I discussed this on a call today. @BCornwall will help from lead this from Traffic. The next steps are fairly... [14:51:36] !log incrementally running puppet agent on A:cp - T403655 [14:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:40] T403655: Configure mw-next-routing for the PHP 8.3 migration - https://phabricator.wikimedia.org/T403655 [14:51:48] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/image-suggestion: sync [14:52:04] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: sync [14:52:46] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/media-analytics: sync [14:53:01] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/media-analytics: sync [14:53:06] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/media-analytics: sync [14:53:23] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/media-analytics: sync [14:54:09] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/page-analytics: sync [14:54:22] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/page-analytics: sync [14:54:27] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/page-analytics: sync [14:54:39] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/page-analytics: sync [14:55:36] (03PS1) 10Effie Mouzeli: hcaptcha: define nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1187471 (https://phabricator.wikimedia.org/T403416) [14:56:44] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: sync [14:56:56] (03PS3) 10Ssingh: service.yaml: add proxoid low-traffic service [puppet] - 10https://gerrit.wikimedia.org/r/1187459 (https://phabricator.wikimedia.org/T403416) [14:57:00] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: sync [14:57:10] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/commons-impact-analytics: sync [14:57:24] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/commons-impact-analytics: sync [14:59:59] (03CR) 10Jasmine: [C:03+2] switchdc: remove mw-wikifunctions discovery services following move to k8s ingress [cookbooks] - 10https://gerrit.wikimedia.org/r/1184125 (https://phabricator.wikimedia.org/T397874) (owner: 10Jasmine) [15:00:05] dduvall and dancy: Time to do the Train log triage deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1500). [15:00:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:00:53] (03CR) 10Elukey: Make maps2012-2014 replica nodes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1187457 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:01:11] (03CR) 10Vgutierrez: service.yaml: add proxoid low-traffic service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187459 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:01:50] (03PS4) 10Ssingh: service.yaml: add proxoid low-traffic service [puppet] - 10https://gerrit.wikimedia.org/r/1187459 (https://phabricator.wikimedia.org/T403416) [15:01:52] (03CR) 10Ssingh: service.yaml: add proxoid low-traffic service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187459 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:03:04] 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11173253 (10elukey) >>! In T402584#11166132, @elukey wrote: > Ack! Upgraded staging, and pinged the DSE SREs as well on slack to gather their opinion about ownership etc..... [15:04:59] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:05:31] (03PS1) 10Ssingh: hiera: O:url_downloader: set LVS realserver pools [puppet] - 10https://gerrit.wikimedia.org/r/1187472 (https://phabricator.wikimedia.org/T403416) [15:06:30] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage [15:07:39] (03PS2) 10Muehlenhoff: Make maps2012-2014 replica nodes [puppet] - 10https://gerrit.wikimedia.org/r/1187457 (https://phabricator.wikimedia.org/T381565) [15:07:41] (03Merged) 10jenkins-bot: switchdc: remove mw-wikifunctions discovery services following move to k8s ingress [cookbooks] - 10https://gerrit.wikimedia.org/r/1184125 (https://phabricator.wikimedia.org/T397874) (owner: 10Jasmine) [15:08:07] (03PS3) 10Muehlenhoff: Make maps2012-2014 replica nodes [puppet] - 10https://gerrit.wikimedia.org/r/1187457 (https://phabricator.wikimedia.org/T381565) [15:08:20] (03CR) 10Muehlenhoff: Make maps2012-2014 replica nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187457 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:12:52] (03CR) 10Vgutierrez: [C:03+1] service.yaml: add proxoid low-traffic service [puppet] - 10https://gerrit.wikimedia.org/r/1187459 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:13:33] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage [15:14:09] (03PS1) 10Ssingh: wmnet: add proxoid A/A service records [dns] - 10https://gerrit.wikimedia.org/r/1187475 (https://phabricator.wikimedia.org/T403416) [15:15:29] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1233.eqiad.wmnet with OS bullseye [15:16:09] (03CR) 10A smart kitten: [C:03+1] "lgtm, this should be a no-op given that `wmgUseWikimediaEvents` is set to `false` for the `private` dblist (which these wikis are [all a m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187382 (https://phabricator.wikimedia.org/T400068) (owner: 10Phuedx) [15:17:05] (03CR) 10CDobbins: [C:03+2] admin: add mahmoud-abdelsattar to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187080 (https://phabricator.wikimedia.org/T403695) (owner: 10CDobbins) [15:18:50] (03PS1) 10Gergő Tisza: Allow creating new WebAuthn passkeys on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187476 (https://phabricator.wikimedia.org/T378402) [15:20:33] (03CR) 10Ssingh: [C:03+2] service.yaml: add proxoid low-traffic service [puppet] - 10https://gerrit.wikimedia.org/r/1187459 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:20:48] ChrisDobbins901_: merging your change as well [15:20:56] ok to do so? [15:21:22] yes! I was just re-reading the Wikitech instructions [15:21:28] thanks [15:21:36] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 12:00:00 on gitlab1003.wikimedia.org with reason: Upgrade [15:21:43] just checking with the other person and typing "multiple" [15:22:01] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 12:00:00 on gitlab2002.wikimedia.org with reason: Upgrade [15:28:28] (03PS2) 10Ssingh: hiera: O:url_downloader: set LVS realserver pools [puppet] - 10https://gerrit.wikimedia.org/r/1187472 (https://phabricator.wikimedia.org/T403416) [15:29:25] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1187472/6891/urldownloader1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1187472 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:30:23] (03CR) 10CDanis: [C:03+1] hiera: O:url_downloader: set LVS realserver pools [puppet] - 10https://gerrit.wikimedia.org/r/1187472 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:30:33] (03CR) 10Ssingh: [C:03+2] hiera: O:url_downloader: set LVS realserver pools [puppet] - 10https://gerrit.wikimedia.org/r/1187472 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:31:21] !log sudo cumin "O:url_downloader" "run-puppet-agent": T403416 [15:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:35] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2005-dev.codfw.wmnet with OS bookworm [15:32:12] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): [QA Task] Verify Android compatability with removal of m. subdomain on test wiki - https://phabricator.wikimedia.org/T404342#11173394 (10Krinkle) a:05402998→03None [15:33:01] (03PS1) 10Ssingh: proxoid: move service to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1187479 (https://phabricator.wikimedia.org/T403416) [15:34:12] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6892/co" [puppet] - 10https://gerrit.wikimedia.org/r/1187479 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:35:01] !log sudo cumin 'A:lvs and (A:eqiad or A:codfw)' 'disable-puppet "adding new service proxoid"': T403416 [15:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:15] (03CR) 10Hnowlan: [C:03+1] proxoid: move service to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1187479 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:35:33] (03CR) 10Ssingh: [V:03+1] "Thanks Hugh" [puppet] - 10https://gerrit.wikimedia.org/r/1187479 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:35:34] (03CR) 10Ssingh: [V:03+1 C:03+2] proxoid: move service to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1187479 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:37:57] !log lvs1020: restart pybal to test proxoid service [15:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:24] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1233.eqiad.wmnet with reason: host reimage [15:39:07] !log restarting pybal on lvs201[34], lvs1016 for proxoid change [15:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:32] !log sudo cumin 'A:lvs and (A:eqiad or A:codfw)' 'run-puppet-agent --enable "adding new service proxoid"' [15:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:35] (03PS1) 10Ssingh: proxoid: set LVS state to production [puppet] - 10https://gerrit.wikimedia.org/r/1187484 (https://phabricator.wikimedia.org/T403416) [15:42:01] !log lvs1019: restart pybal to test proxoid service [15:42:02] (03PS1) 10Elukey: Upgrade services using prometheus-statsd-exporter 0.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187485 (https://phabricator.wikimedia.org/T404368) [15:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:18] !log lvs201[34]: restart pybal to test proxoid service [15:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:38] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:43:36] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:43:41] cool [15:43:44] (03CR) 10Ssingh: [C:03+2] proxoid: set LVS state to production [puppet] - 10https://gerrit.wikimedia.org/r/1187484 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:44:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1233.eqiad.wmnet with reason: host reimage [15:44:58] !log sudo cumin 'A:dnsbox' run-puppet-agent [15:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:35] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:46:21] stevemunene: ^ [15:46:31] any dea who was working on these? [15:46:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:46:35] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:46:59] {"dse-k8s-worker2002.codfw.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=codfw,cluster=dse-k8s,service=kubesvc"} [15:47:24] (03PS8) 10Jcrespo: bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:47:40] btullis@cumin1003 reimage (PID 2117294) is awaiting input [15:48:38] sukhe: No production works going on in the cluster just yet but having a look [15:48:43] thanks [15:49:00] (03CR) 10Elukey: [C:03+1] Make maps2012-2014 replica nodes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1187457 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:50:10] (03CR) 10Effie Mouzeli: [C:03+1] wmnet: add proxoid A/A service records [dns] - 10https://gerrit.wikimedia.org/r/1187475 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:50:18] (03CR) 10Ssingh: [C:03+2] wmnet: add proxoid A/A service records [dns] - 10https://gerrit.wikimedia.org/r/1187475 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [15:50:34] (03CR) 10Jcrespo: [C:03+2] bacula::client: On Trixie hosts install the FD from component/bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187020 (https://phabricator.wikimedia.org/T404114) (owner: 10Muehlenhoff) [15:50:44] (03CR) 10JMeybohm: [C:04-2] "There is a default for this in hiera `profile::kubernetes::deployment_server::general`." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187485 (https://phabricator.wikimedia.org/T404368) (owner: 10Elukey) [15:50:45] !log sukhe@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=proxoid [15:51:13] !log sukhe@dns1004 START - running authdns-update [15:51:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.463 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:52:23] !log sukhe@dns1004 END - running authdns-update [15:53:12] (03CR) 10Elukey: "Okok forgot about it, a -2 seems a bit too much :D I'll abandon the change, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187485 (https://phabricator.wikimedia.org/T404368) (owner: 10Elukey) [15:53:27] (03Abandoned) 10Elukey: Upgrade services using prometheus-statsd-exporter 0.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187485 (https://phabricator.wikimedia.org/T404368) (owner: 10Elukey) [15:53:58] FIRING: [3x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:03] sukhe: Alerted the team as well incase there is anything [15:54:06] thanks [15:54:13] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [15:55:00] (03CR) 10JMeybohm: "And I thought this is the perfect occasion to yell "Do not submit" 😂" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187485 (https://phabricator.wikimedia.org/T404368) (owner: 10Elukey) [15:55:51] (03PS1) 10Elukey: role::deployment_server::kubernetes: upgrade the default statsd image [puppet] - 10https://gerrit.wikimedia.org/r/1187489 (https://phabricator.wikimedia.org/T404368) [15:58:02] (03CR) 10JMeybohm: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1187489 (https://phabricator.wikimedia.org/T404368) (owner: 10Elukey) [15:58:58] FIRING: [5x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:00:03] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [16:00:05] jhathaway and moritzm: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:05] jasmine_, swfrench-wmf, and hnowlan: OwO what's this, a deployment window?? Southward Datacenter Switchover [Livetest]. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1600). nyaa~ [16:00:23] o/ [16:01:52] history [16:02:10] o> [16:03:13] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [16:04:13] !log jhancock@cumin1002 START - Cookbook sre.dns.netbox [16:06:19] (03PS1) 10Jcrespo: bacula: Fix repo configuration for bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187490 (https://phabricator.wikimedia.org/T404114) [16:08:03] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [16:08:37] (03CR) 10Majavah: [C:03+1] bacula: Fix repo configuration for bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187490 (https://phabricator.wikimedia.org/T404114) (owner: 10Jcrespo) [16:08:51] (03CR) 10Dzahn: [C:03+1] "Error: Component 'thirdparty/bacula9' as given to --component is not know." [puppet] - 10https://gerrit.wikimedia.org/r/1187490 (https://phabricator.wikimedia.org/T404114) (owner: 10Jcrespo) [16:09:09] (03CR) 10Jcrespo: [C:03+2] bacula: Fix repo configuration for bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/1187490 (https://phabricator.wikimedia.org/T404114) (owner: 10Jcrespo) [16:10:34] !log jhancock@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-ctrl2006 to codfw - jhancock@cumin1002" [16:10:39] !log jhancock@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-ctrl2006 to codfw - jhancock@cumin1002" [16:10:39] !log jhancock@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:11:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:12:33] !log jhancock@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl2006 [16:12:42] !log jhancock@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl2006 [16:13:04] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:13:44] (03PS1) 10Jdlrobson: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187491 [16:13:58] FIRING: [5x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:08] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:14:25] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:14:26] (03CR) 10Elukey: [C:03+2] role::deployment_server::kubernetes: upgrade the default statsd image [puppet] - 10https://gerrit.wikimedia.org/r/1187489 (https://phabricator.wikimedia.org/T404368) (owner: 10Elukey) [16:14:35] (03CR) 10Elukey: role::deployment_server::kubernetes: upgrade the default statsd image [puppet] - 10https://gerrit.wikimedia.org/r/1187489 (https://phabricator.wikimedia.org/T404368) (owner: 10Elukey) [16:15:25] (03PS2) 10Elukey: role::deployment_server::kubernetes: upgrade the default statsd image [puppet] - 10https://gerrit.wikimedia.org/r/1187489 (https://phabricator.wikimedia.org/T404368) [16:15:32] (03CR) 10Elukey: role::deployment_server::kubernetes: upgrade the default statsd image (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187489 (https://phabricator.wikimedia.org/T404368) (owner: 10Elukey) [16:15:45] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:16:02] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:17:28] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:19:57] We're about to start the DC switchover live test - please refrain from making any major changes and especially deploys! [16:19:59] 06SRE, 10bacula, 10Data-Persistence-Backup, 10Infrastructure Security, and 3 others: Trixie bacula-fd package incompatible with our bacula installation - https://phabricator.wikimedia.org/T404114#11173651 (10jcrespo) Configuration looks as intended: `lines=10 root@people1005:/etc/apt/sources.list.d$ cat co... [16:20:41] (03PS1) 10Jcrespo: Revert "bacula: Ignore backup failures from people1005 & people2004" [puppet] - 10https://gerrit.wikimedia.org/r/1187492 (https://phabricator.wikimedia.org/T404114) [16:20:59] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:21:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 5.968 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:21:39] (03CR) 10Dzahn: [C:03+2] Revert "bacula: Ignore backup failures from people1005 & people2004" [puppet] - 10https://gerrit.wikimedia.org/r/1187492 (https://phabricator.wikimedia.org/T404114) (owner: 10Jcrespo) [16:24:47] jhancock@cumin1002 provision (PID 2017869) is awaiting input [16:25:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:25:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11173683 (10Jhancock.wm) [16:27:05] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:28:22] 06SRE, 10bacula, 10Data-Persistence-Backup, 10Infrastructure Security, and 3 others: Trixie bacula-fd package incompatible with our bacula installation - https://phabricator.wikimedia.org/T404114#11173692 (10jcrespo) 05Open→03Resolved a:03jcrespo [16:31:43] (03CR) 10Dzahn: [C:03+2] zuul::executor: systctl setting unprivileged_userns_clone needed [puppet] - 10https://gerrit.wikimedia.org/r/1187055 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [16:35:58] (03CR) 10Dzahn: [C:03+2] "puppet error - duplicate declaration - because it's already defined in the base module - working on a fix" [puppet] - 10https://gerrit.wikimedia.org/r/1187055 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [16:36:11] (03CR) 10DCausse: [C:03+1] cirrus: Start AB test of did-you-mean profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187108 (https://phabricator.wikimedia.org/T390858) (owner: 10Ebernhardson) [16:36:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:37:40] (03PS1) 10Btullis: Revert "Add four new (renamed) an-worker nodes to the Hadoop cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1187494 [16:38:28] (03CR) 10Btullis: [C:03+2] Revert "Add four new (renamed) an-worker nodes to the Hadoop cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1187494 (owner: 10Btullis) [16:41:33] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.942 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:43:38] (03CR) 10Vgutierrez: hcaptcha: define nginx timeouts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187471 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [16:43:58] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:13] (03PS1) 10Dzahn: zuul::executor: use sysctl setting from base module, remove local code [puppet] - 10https://gerrit.wikimedia.org/r/1187495 (https://phabricator.wikimedia.org/T403847) [16:45:36] (03CR) 10CDanis: [C:03+2] turnilo: re-add summed-up TTFB measure [puppet] - 10https://gerrit.wikimedia.org/r/1187048 (owner: 10CDanis) [16:46:18] (03CR) 10Dzahn: [C:03+2] "just need to use existing hiera key instead! https://gerrit.wikimedia.org/r/c/operations/puppet/+/1187495" [puppet] - 10https://gerrit.wikimedia.org/r/1187055 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [16:48:11] (03CR) 10BryanDavis: "Cause of T404379 in Beta Cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1187472 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [16:52:35] (03CR) 10Dzahn: [C:03+2] zuul::executor: use sysctl setting from base module, remove local code [puppet] - 10https://gerrit.wikimedia.org/r/1187495 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [16:55:13] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:56:50] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [16:59:07] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695#11173818 (10CDobbins) @mahmoud.abdelsattar.wmde you should be all set up. If not, please let me know and I'll fix... [16:59:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183144 (https://phabricator.wikimedia.org/T402353) (owner: 10Daimona Eaytoy) [16:59:54] btullis@cumin1003 reimage (PID 2117294) is awaiting input [17:00:04] jasmine_, swfrench-wmf, and hnowlan: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Southward Datacenter Switchover [Livetest]. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1600). [17:00:05] bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1700) [17:00:23] (03PS3) 10Daimona Eaytoy: Configure high-risk countries for CampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183144 (https://phabricator.wikimedia.org/T402353) [17:00:26] * bd808 looks to see what might be ready [17:01:27] (03CR) 10Ssingh: [C:03+2] "Thanks for letting me know. I will address it once I finish the main rollout." [puppet] - 10https://gerrit.wikimedia.org/r/1187472 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [17:01:49] (03PS1) 10Majavah: dnsrecursor: Add an option to log queries [puppet] - 10https://gerrit.wikimedia.org/r/1187497 (https://phabricator.wikimedia.org/T404373) [17:01:52] (03PS1) 10Majavah: P:openstack: pdns::recursor: Log queries in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1187498 (https://phabricator.wikimedia.org/T404373) [17:03:45] (03PS1) 10Ssingh: P:haptcha: set PKI for proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187499 (https://phabricator.wikimedia.org/T403416) [17:03:58] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [17:04:31] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6893/co" [puppet] - 10https://gerrit.wikimedia.org/r/1187499 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [17:05:03] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:05:10] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:05:32] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6894/co" [puppet] - 10https://gerrit.wikimedia.org/r/1187497 (https://phabricator.wikimedia.org/T404373) (owner: 10Majavah) [17:06:12] (03PS2) 10Majavah: dnsrecursor: Add an option to log queries [puppet] - 10https://gerrit.wikimedia.org/r/1187497 (https://phabricator.wikimedia.org/T404373) [17:06:12] (03PS2) 10Majavah: P:openstack: pdns::recursor: Log queries in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1187498 (https://phabricator.wikimedia.org/T404373) [17:08:11] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6895/console" [puppet] - 10https://gerrit.wikimedia.org/r/1187497 (https://phabricator.wikimedia.org/T404373) (owner: 10Majavah) [17:13:13] no developer-portal build to push out this week. [17:13:13] (03PS2) 10Effie Mouzeli: hcaptcha: define timeouts for hcaptcha [puppet] - 10https://gerrit.wikimedia.org/r/1187471 (https://phabricator.wikimedia.org/T403416) [17:13:48] (03PS3) 10Majavah: dnsrecursor: Add an option to log queries [puppet] - 10https://gerrit.wikimedia.org/r/1187497 (https://phabricator.wikimedia.org/T404373) [17:13:48] (03PS3) 10Majavah: P:openstack: pdns::recursor: Log queries in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1187498 (https://phabricator.wikimedia.org/T404373) [17:14:03] (03CR) 10Ssingh: [C:03+1] "Looks good from the prod DNS perspective; leaving the decision reasoning for WMCS in T404373 for your judgement. Thanks for running PCC." [puppet] - 10https://gerrit.wikimedia.org/r/1187497 (https://phabricator.wikimedia.org/T404373) (owner: 10Majavah) [17:14:50] (03PS4) 10Majavah: dnsrecursor: Add an option to log queries [puppet] - 10https://gerrit.wikimedia.org/r/1187497 (https://phabricator.wikimedia.org/T404373) [17:14:50] (03PS4) 10Majavah: P:openstack: pdns::recursor: Log queries in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1187498 (https://phabricator.wikimedia.org/T404373) [17:14:58] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:15:51] (03CR) 10Majavah: dnsrecursor: Add an option to log queries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187497 (https://phabricator.wikimedia.org/T404373) (owner: 10Majavah) [17:15:52] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:16:46] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6897/console" [puppet] - 10https://gerrit.wikimedia.org/r/1187497 (https://phabricator.wikimedia.org/T404373) (owner: 10Majavah) [17:16:52] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:17:17] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:20:35] (03PS2) 10CDobbins: admin: add johannesrichterwmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187094 (https://phabricator.wikimedia.org/T404080) [17:20:39] (03PS1) 10Ssingh: P:trafficserver: switch hcaptcha to proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) [17:20:43] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:21:23] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6898/co" [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [17:22:24] (03CR) 10Ssingh: [V:03+1] "Please see Depends-On before reviewing this." [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [17:24:00] (03CR) 10CDobbins: [C:03+2] admin: add johannesrichterwmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187094 (https://phabricator.wikimedia.org/T404080) (owner: 10CDobbins) [17:25:43] (03CR) 10Ssingh: [C:03+1] dnsrecursor: Add an option to log queries [puppet] - 10https://gerrit.wikimedia.org/r/1187497 (https://phabricator.wikimedia.org/T404373) (owner: 10Majavah) [17:26:16] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1056.eqiad.wmnet with OS bookworm [17:26:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11173884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm [17:26:31] (03CR) 10Ssingh: [C:03+1] admin: add johannesrichterwmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1187094 (https://phabricator.wikimedia.org/T404080) (owner: 10CDobbins) [17:30:13] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:31:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:36:28] (03CR) 10Dzahn: [C:03+1] "lgtm. just one thing. it seems like they have not been added to the "NDA and MOU" spreadsheet yet and only legal can write to it. maybe yo" [puppet] - 10https://gerrit.wikimedia.org/r/1187094 (https://phabricator.wikimedia.org/T404080) (owner: 10CDobbins) [17:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:39:02] (03CR) 10Kimberly Sarabia: [C:04-1] ReaderExperiments' ImageBrowsing stream configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403255) (owner: 10Marco Fossati) [17:41:33] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:42:17] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host es1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:47:09] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695#11173916 (10CDobbins) @KFrancis: could you add @mahmoud.abdelsattar.wmde to the NDA spreadsheet? I wasn't able to... [17:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:49:03] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11173931 (10CDobbins) @KFrancis: could you add @Johannes_Richter_WMDE to the NDA spreadsheet? I wasn't able to find that username, so if I'm mistaken, my apolo... [17:49:20] 10ops-eqsin, 06SRE: WMF RIPE Atlas probe in Eqsin offline - https://phabricator.wikimedia.org/T382519#11173932 (10RobH) Awesome. It likely isn't worth a stand alone ticket for removal and disposal, so instead I'm going to keep this open until I do the following: * update notes of object in netbox with link t... [17:50:07] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.400 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:52:19] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11173939 (10CDobbins) @Johannes_Richter_WMDE you should be set up now. If there's something I missed, please reply and I'll fix it ASAP! [17:54:34] !log jhancock@cumin1002 START - Cookbook sre.dns.netbox [17:54:36] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:54:56] (03PS3) 10Kosta Harlan: hCaptcha: Special handling for hcaptcha-secure-api.js requests [puppet] - 10https://gerrit.wikimedia.org/r/1187439 (https://phabricator.wikimedia.org/T404251) [17:55:59] 10ops-eqsin, 06SRE: WMF RIPE Atlas probe in Eqsin offline - https://phabricator.wikimedia.org/T382519#11173941 (10RobH) 05Open→03Resolved notes for device now include: Anchor offline and power port powered down per T382519. No sensitive data on device, can be disposed of in next recycling. set device... [17:56:05] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host es1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:57:13] !log jhancock@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:57:39] (03CR) 10Brouberol: [C:03+2] dumps: disable rsync access for 2 dead dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1187016 (https://phabricator.wikimedia.org/T402987) (owner: 10Xcollazo) [17:59:02] sukhe I see you +1ed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1187094. I'm going to merge it alongside my patch [17:59:10] !log jhancock@cumin1002 START - Cookbook sre.dns.netbox [17:59:29] brouberol: ah sure. ChrisDobbins901_ ^ [17:59:43] all merged [18:00:00] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T403695#11173947 (10KFrancis) Done, apologies for the delay! [18:00:05] dduvall and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T1800). nyaa~ [18:02:48] !log jhancock@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dse-k8s-worker2003 to codfw - jhancock@cumin1002" [18:03:39] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:03:57] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:04:06] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:04:13] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187505 (https://phabricator.wikimedia.org/T396379) [18:04:16] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187505 (https://phabricator.wikimedia.org/T396379) (owner: 10TrainBranchBot) [18:04:25] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11173976 (10KFrancis) Done! [18:04:56] !log jhancock@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dse-k8s-worker2003 to codfw - jhancock@cumin1002" [18:04:56] !log jhancock@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:05:09] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187505 (https://phabricator.wikimedia.org/T396379) (owner: 10TrainBranchBot) [18:05:09] !log jhancock@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker2003 [18:05:18] !log jhancock@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker2003 [18:05:52] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:06:23] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:13:28] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1056.eqiad.wmnet with OS bookworm [18:13:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11173998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm executed with errors: - es1056 (**F... [18:14:51] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.18 refs T396379 [18:14:55] T396379: 1.45.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T396379 [18:17:01] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1056.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:20:20] vriley@cumin1003 reimage (PID 2138902) is awaiting input [18:21:18] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1056.eqiad.wmnet with OS bookworm [18:21:25] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11174010 (10jsn.sherman) [18:21:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11174011 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm [18:21:56] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2006-dev.codfw.wmnet with OS bookworm [18:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:40:42] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2006-dev.codfw.wmnet with reason: host reimage [18:41:30] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.923 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:42:38] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:42:56] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:43:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2006-dev.codfw.wmnet with reason: host reimage [18:54:13] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11174099 (10CDobbins) 05Open→03In progress [18:54:15] (03CR) 10Vgutierrez: hcaptcha: define timeouts for hcaptcha (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187471 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [18:57:21] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11174103 (10CDobbins) From what I can tell by checking [[ https://phabricator.wikimedia.org/project/members/61/ | WMF-NDA group membership ]] and [[ https://docs.google.com/spreadsheets/d... [19:01:34] vriley@cumin1003 reimage (PID 2139472) is awaiting input [19:01:51] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2006-dev.codfw.wmnet with OS bookworm [19:05:17] (03PS2) 10Ssingh: P:trafficserver: switch hcaptcha to proxoid.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) [19:06:00] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11174138 (10CDobbins) @ECohen_WMDE, how do you want to do the public key confirmation? One of the most common methods is to put your pubkey on your Mediawiki page, usually in the Contact... [19:06:02] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6899/co" [puppet] - 10https://gerrit.wikimedia.org/r/1187502 (https://phabricator.wikimedia.org/T403416) (owner: 10Ssingh) [19:09:38] (03CR) 10Ssingh: "Don't want to duplicate the effort but there is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1187502, which I wasn't aware this wa" [puppet] - 10https://gerrit.wikimedia.org/r/1187471 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [19:10:17] (03PS2) 10Bking: opensearch-operator: point to correct image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187109 (https://phabricator.wikimedia.org/T397246) [19:12:48] (03CR) 10Bking: "Thanks, fixed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187109 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [19:12:58] (03CR) 10Bking: [C:03+2] opensearch-operator: point to correct image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187109 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [19:14:04] (03PS1) 10Scott French: Configure cookie-based enrollment in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184935 (https://phabricator.wikimedia.org/T403657) [19:16:03] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [19:19:52] (03CR) 10Krinkle: [C:03+1] Configure cookie-based enrollment in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184935 (https://phabricator.wikimedia.org/T403657) (owner: 10Scott French) [19:23:45] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11174198 (10KFrancis) I don't have a record of an NDA on file for ECohen_WMDE (Elisha Cohen). Please send an email address for Elisha and I will process an NDA. Thanks! [19:30:43] 06SRE, 06Data-Engineering, 06Traffic: Add pageview information to turnilo's webrequest_sampled_live (is_pageview is always "-") - https://phabricator.wikimedia.org/T402612#11174218 (10CDanis) Thank you @JAllemandou ! This doesn't look too hard to implement in bloblang -- thankfully the definition of [[ htt... [19:34:55] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [19:36:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:37:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:38:47] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [19:44:12] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 203635208 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:45:12] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1096 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:51:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:55:48] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [19:55:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11174274 (10VRiley-WMF) Only server left is es1056, which is giving me a strange error. Looking into this with @Papaul Was informed to check to see if in BIOS all disks are bein... [19:57:17] (03PS1) 10C. Scott Ananian: Deploy Parsoid Read Views to 23 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187521 (https://phabricator.wikimedia.org/T404390) [19:58:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187521 (https://phabricator.wikimedia.org/T404390) (owner: 10C. Scott Ananian) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T2000). [20:00:05] sbassett, Daimona, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:27] i'm here [20:00:30] here [20:00:34] i can spiderpig [20:00:38] o/ [20:01:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 6.813 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:02:31] (03CR) 10Subramanya Sastry: [C:03+1] Deploy Parsoid Read Views to 23 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187521 (https://phabricator.wikimedia.org/T404390) (owner: 10C. Scott Ananian) [20:02:54] sbassett: you're first on the list [20:03:45] cscott: sure, should I just deploy via spiderpig? haven’t done a bp deploy in a while :) my patch should basically be a noop anyways, really just more to stage code on a single wmf version. [20:04:40] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11174299 (10Andrew) ...and now it's 100% bookworm/reef [20:05:52] spiderpig should work [20:05:59] Ok, will run mine now [20:06:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187101 (https://phabricator.wikimedia.org/T145915) (owner: 10SBassett) [20:06:23] is RoanKattouw urbanecm TheresNoTime kindrobot or cjming around if the pig goes awry (not that i expect it to) [20:07:01] * sbassett can usually get ahold of Releng folks if disaster strikes [20:07:18] (03Merged) 10jenkins-bot: Optionally encrypt OTP secret in the database [extensions/OATHAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187101 (https://phabricator.wikimedia.org/T145915) (owner: 10SBassett) [20:07:18] good enough :) [20:07:34] !log sbassett@deploy1003 Started scap sync-world: Backport for [[gerrit:1187101|Optionally encrypt OTP secret in the database (T145915)]] [20:07:38] T145915: OATHAuth OTP shouldn't be stored in cleartext in the DB - https://phabricator.wikimedia.org/T145915 [20:09:57] (03PS3) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [20:10:42] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [20:11:37] K8s test server deployment was fairly slow… [20:13:24] !log sbassett@deploy1003 sbassett: Backport for [[gerrit:1187101|Optionally encrypt OTP secret in the database (T145915)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:13:28] T145915: OATHAuth OTP shouldn't be stored in cleartext in the DB - https://phabricator.wikimedia.org/T145915 [20:13:58] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:14:34] !log sbassett@deploy1003 sbassett: Continuing with sync [20:20:13] !log sbassett@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187101|Optionally encrypt OTP secret in the database (T145915)]] (duration: 12m 38s) [20:20:17] T145915: OATHAuth OTP shouldn't be stored in cleartext in the DB - https://phabricator.wikimedia.org/T145915 [20:20:34] Daimona: you're next? [20:20:47] Aye [20:20:48] sbassett: looks like you're done? [20:20:59] cscott: yop, all good. thanks. [20:21:48] I'd need a hero to deploy my thing tho [20:22:31] oh? it looks like a single config var addition [20:22:56] Yeah and it's a no-op [20:23:14] (as in: the code that reads it is not in production yet) [20:23:18] i'm happy to spider pig it along with my config change if that works for you [20:24:14] Sure! [20:24:23] (03CR) 10C. Scott Ananian: [C:03+1] Configure high-risk countries for CampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183144 (https://phabricator.wikimedia.org/T402353) (owner: 10Daimona Eaytoy) [20:24:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187521 (https://phabricator.wikimedia.org/T404390) (owner: 10C. Scott Ananian) [20:24:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183144 (https://phabricator.wikimedia.org/T402353) (owner: 10Daimona Eaytoy) [20:25:04] (03PS4) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [20:25:06] Thank you! [20:25:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:25:16] no worries, zero additional time since i had a config change too [20:26:09] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to 23 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187521 (https://phabricator.wikimedia.org/T404390) (owner: 10C. Scott Ananian) [20:26:16] (03Merged) 10jenkins-bot: Configure high-risk countries for CampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183144 (https://phabricator.wikimedia.org/T402353) (owner: 10Daimona Eaytoy) [20:26:22] (03CR) 10BBlack: varnish: Add "Vary: User-Agent" during delivery of pageviews (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [20:26:32] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1187521|Deploy Parsoid Read Views to 23 Wikipedias (T404390)]], [[gerrit:1183144|Configure high-risk countries for CampaignEvents (T402353)]] [20:26:40] T404390: Parsoid Read Views to Wikipedia deploy 2025-09-11 - https://phabricator.wikimedia.org/T404390 [20:26:40] T402353: Organizer can toggle on collaborative contributions for qualified events (nice to have for MVP) - https://phabricator.wikimedia.org/T402353 [20:29:32] (03CR) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [20:32:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:51] !log cscott@deploy1003 daimona, cscott: Backport for [[gerrit:1187521|Deploy Parsoid Read Views to 23 Wikipedias (T404390)]], [[gerrit:1183144|Configure high-risk countries for CampaignEvents (T402353)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:32:56] (03PS1) 10Dzahn: zuul::executor: create systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1187531 (https://phabricator.wikimedia.org/T403847) [20:32:57] T404390: Parsoid Read Views to Wikipedia deploy 2025-09-11 - https://phabricator.wikimedia.org/T404390 [20:32:58] T402353: Organizer can toggle on collaborative contributions for qualified events (nice to have for MVP) - https://phabricator.wikimedia.org/T402353 [20:33:12] (03CR) 10CI reject: [V:04-1] zuul::executor: create systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1187531 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [20:33:47] (03PS2) 10Dzahn: zuul::executor: create systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1187531 (https://phabricator.wikimedia.org/T403847) [20:34:33] Daimona: I suppose there's nothing to test w/ your patch? [20:35:20] !log cscott@deploy1003 daimona, cscott: Continuing with sync [20:36:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:37:21] Yep [20:40:35] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187521|Deploy Parsoid Read Views to 23 Wikipedias (T404390)]], [[gerrit:1183144|Configure high-risk countries for CampaignEvents (T402353)]] (duration: 14m 02s) [20:40:40] T404390: Parsoid Read Views to Wikipedia deploy 2025-09-11 - https://phabricator.wikimedia.org/T404390 [20:40:41] T402353: Organizer can toggle on collaborative contributions for qualified events (nice to have for MVP) - https://phabricator.wikimedia.org/T402353 [20:43:55] done! [20:43:58] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:45:30] (03CR) 10BBlack: varnish: Add "Vary: User-Agent" during delivery of pageviews (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [20:48:08] (03CR) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [20:50:50] (03CR) 10Kosta Harlan: hCaptcha: Special handling for hcaptcha-secure-api.js requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187439 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [20:50:59] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11174441 (10Johannes_Richter_WMDE) 05Open→03Resolved [20:51:21] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11174442 (10Johannes_Richter_WMDE) Thanks, everything works as expected. [20:51:41] (03PS5) 10Krinkle: varnish: Add "Vary: User-Agent" during delivery of pageviews [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) [20:55:06] (03CR) 10Kosta Harlan: hCaptcha: Special handling for hcaptcha-secure-api.js requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187439 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [20:59:17] (03CR) 10Krinkle: "Tests passing for me now [1]" [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250911T2100) [21:02:21] FYI, if the Web Team doesn't have anything planned for the window today, I have a mediawiki-config patch I'd like to deploy [21:02:31] * swfrench-wmf will wait the standard 5 minutes [21:03:58] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [21:13:15] (03CR) 10BCornwall: [V:03+1 C:03+1] "Looking good to me - I think it's too late in the day to merge now and tomorrow is friday, so let's shoot for monday?" [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [21:13:30] (03CR) 10BBlack: [C:03+1] "LGTM, I'd wait on brett and/or wait for a better time of day for people to be around and observe" [puppet] - 10https://gerrit.wikimedia.org/r/1187464 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [21:21:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:26:30] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.154 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:30:48] !log jforrester Deployed security patch for T404392 [21:33:58] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:34:51] jouncebot: next [21:34:51] In 8 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250912T0600) [21:37:39] * swfrench-wmf is proceeding with mediawiki-config change [21:38:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184935 (https://phabricator.wikimedia.org/T403657) (owner: 10Scott French) [21:38:49] (03Merged) 10jenkins-bot: Configure cookie-based enrollment in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184935 (https://phabricator.wikimedia.org/T403657) (owner: 10Scott French) [21:39:18] !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1184935|Configure cookie-based enrollment in PHP 8.3 (T403657)]] [21:39:23] T403657: Configure the WikimediaEvents extension for the PHP 8.3 migration - https://phabricator.wikimedia.org/T403657 [21:40:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:45:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.221 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:48:58] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:54:59] (03PS1) 10Zabe: Add Apache configuration for Wikimedia Thailand wiki [puppet] - 10https://gerrit.wikimedia.org/r/1187539 (https://phabricator.wikimedia.org/T400001) [21:59:22] (03PS2) 10Andrea Denisse: alert: Add Slack route to send Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) [22:04:06] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/namespaces on k8s-dse@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:04:56] !log swfrench@deploy1003 swfrench: Backport for [[gerrit:1184935|Configure cookie-based enrollment in PHP 8.3 (T403657)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:05:00] T403657: Configure the WikimediaEvents extension for the PHP 8.3 migration - https://phabricator.wikimedia.org/T403657 [22:05:26] * swfrench-wmf is testing [22:08:18] !log swfrench@deploy1003 swfrench: Continuing with sync [22:10:23] (03PS1) 10Jclark-ctr: fix dse-k8s-worker1014 drives are not nvme and set to nvme [puppet] - 10https://gerrit.wikimedia.org/r/1187542 (https://phabricator.wikimedia.org/T399779) [22:11:53] (03CR) 10Jclark-ctr: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1187542 (https://phabricator.wikimedia.org/T399779) (owner: 10Jclark-ctr) [22:13:42] (03CR) 10Jclark-ctr: [C:03+2] fix dse-k8s-worker1014 drives are not nvme and set to nvme [puppet] - 10https://gerrit.wikimedia.org/r/1187542 (https://phabricator.wikimedia.org/T399779) (owner: 10Jclark-ctr) [22:14:11] (03CR) 10RobH: [C:03+2] fix dse-k8s-worker1014 drives are not nvme and set to nvme [puppet] - 10https://gerrit.wikimedia.org/r/1187542 (https://phabricator.wikimedia.org/T399779) (owner: 10Jclark-ctr) [22:16:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [22:16:30] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11174607 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [22:20:47] !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184935|Configure cookie-based enrollment in PHP 8.3 (T403657)]] (duration: 41m 28s) [22:20:51] T403657: Configure the WikimediaEvents extension for the PHP 8.3 migration - https://phabricator.wikimedia.org/T403657 [22:25:40] (03PS1) 10Jasmine: switchdc: call delete_collection_namespaced_cron_job if available [cookbooks] - 10https://gerrit.wikimedia.org/r/1187544 (https://phabricator.wikimedia.org/T399891) [22:27:49] (03PS2) 10Jasmine: switchdc: call delete_collection_namespaced_cron_job if available [cookbooks] - 10https://gerrit.wikimedia.org/r/1187544 (https://phabricator.wikimedia.org/T399891) [22:28:57] (03CR) 10Andrea Denisse: "Hi folks, I tested this in the #api-alerts-test Slack channel. I sent alerts for the `mediawiki-engineering` and the `dog-owners` team to " [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) (owner: 10Andrea Denisse) [22:29:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:30:06] (03PS3) 10Andrea Denisse: alert: Add Slack route to send Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1184611 (https://phabricator.wikimedia.org/T401730) [22:35:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:35:13] (03CR) 10CI reject: [V:04-1] switchdc: call delete_collection_namespaced_cron_job if available [cookbooks] - 10https://gerrit.wikimedia.org/r/1187544 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [22:37:28] (03PS3) 10Jasmine: switchdc: call delete_collection_namespaced_cron_job if available [cookbooks] - 10https://gerrit.wikimedia.org/r/1187544 (https://phabricator.wikimedia.org/T399891) [22:38:41] jclark@cumin1002 reimage (PID 2346817) is awaiting input [22:43:15] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11174684 (10Jclark-ctr) @bking i am struggling with getting this server to image and continue to get this error can you assist? {F66015226} [22:46:43] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm [22:46:44] FIRING: RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:46:55] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11174690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dse-k8s-worker1014.eqiad.wmnet with OS bookworm executed with errors: - dse-k8s-worker10... [22:51:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:00:10] FIRING: BFDdown: BFD session down between cr3-eqsin and 103.102.166.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:01:44] RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:05:10] RESOLVED: BFDdown: BFD session down between cr3-eqsin and 103.102.166.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:05:50] (03PS11) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [23:06:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:09:02] (03PS1) 10Jforrester: Surface custom errors on ZObjectStringRenderer and FunctionInputParser fields [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187553 (https://phabricator.wikimedia.org/T395475) [23:09:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187553 (https://phabricator.wikimedia.org/T395475) (owner: 10Jforrester) [23:15:08] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [23:15:08] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.550 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:15:28] (03Merged) 10jenkins-bot: Surface custom errors on ZObjectStringRenderer and FunctionInputParser fields [extensions/WikiLambda] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1187553 (https://phabricator.wikimedia.org/T395475) (owner: 10Jforrester) [23:15:48] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1187553|Surface custom errors on ZObjectStringRenderer and FunctionInputParser fields (T395475)]] [23:15:52] T395475: Usability: Unclear input format and error handling for the "Age" function hinder successful use - https://phabricator.wikimedia.org/T395475 [23:20:00] (03PS12) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [23:21:19] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1187553|Surface custom errors on ZObjectStringRenderer and FunctionInputParser fields (T395475)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:21:25] T395475: Usability: Unclear input format and error handling for the "Age" function hinder successful use - https://phabricator.wikimedia.org/T395475 [23:21:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:21:36] (03CR) 10Scott French: [C:03+1] "Thanks for catching this!" [puppet] - 10https://gerrit.wikimedia.org/r/1187489 (https://phabricator.wikimedia.org/T404368) (owner: 10Elukey) [23:21:41] !log jforrester@deploy1003 jforrester: Continuing with sync [23:27:57] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [23:28:49] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187553|Surface custom errors on ZObjectStringRenderer and FunctionInputParser fields (T395475)]] (duration: 13m 01s) [23:28:54] T395475: Usability: Unclear input format and error handling for the "Age" function hinder successful use - https://phabricator.wikimedia.org/T395475 [23:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1187566 [23:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1187566 (owner: 10TrainBranchBot) [23:53:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1187566 (owner: 10TrainBranchBot)