[00:10:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:15:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:37:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1203283 [00:37:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1203283 (owner: 10TrainBranchBot) [00:47:57] (03PS4) 10Scott French: hieradata: pilot cfssl/pki for etcd on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1182658 (https://phabricator.wikimedia.org/T352245) [00:48:15] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182658 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [00:51:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1203283 (owner: 10TrainBranchBot) [00:54:21] (03PS1) 10Scott French: deployment_server: migrate mw-(cron|videoscaler) to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203285 (https://phabricator.wikimedia.org/T405955) [00:54:23] (03PS1) 10Scott French: mw-(api-ext|web): serve 50% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203284 (https://phabricator.wikimedia.org/T405955) [01:00:41] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:07:19] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203286 (https://phabricator.wikimedia.org/T402389) [01:07:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1203287 [01:07:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1203287 (owner: 10TrainBranchBot) [01:09:06] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:14:58] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 14m 17s) [01:29:47] (03CR) 10STran: [C:03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203286 (https://phabricator.wikimedia.org/T402389) (owner: 10STran) [01:30:35] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1203287 (owner: 10TrainBranchBot) [01:31:51] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203286 (https://phabricator.wikimedia.org/T402389) (owner: 10STran) [01:33:22] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:36:49] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [01:38:22] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:38:52] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [01:41:43] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [01:42:12] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [01:42:15] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [01:42:40] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [01:52:55] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-d6-eqiad (10.64.128.29) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:18:22] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:52:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:57:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:58:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:03:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:21:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:26:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:27:06] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:37:12] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:01:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:04:18] 06SRE, 10Gerrit, 10observability, 06Release-Engineering-Team (Radar): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086#11357128 (10Pppery) [05:08:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:06] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:13:30] 06SRE, 06Traffic-Icebox, 13Patch-Needs-Improvement: Preserve Server response header when generating custom error page via VCL - https://phabricator.wikimedia.org/T285926#11357159 (10Pppery) [05:33:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:06] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:44:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:49:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:52:55] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-d6-eqiad (10.64.128.29) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:03:38] (03CR) 10Giuseppe Lavagetto: [C:03+1] deployment_server: migrate mw-(cron|videoscaler) to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203285 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [06:18:28] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-(api-ext|web): serve 50% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203284 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [06:31:46] (03PS2) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) [06:32:24] (03CR) 10CI reject: [V:04-1] cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [06:34:47] (03PS3) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) [06:39:43] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [06:51:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:27:06] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:27:30] (03PS1) 10KartikMistry: apertium: staging: Update to 2025-11-10-034557-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203296 (https://phabricator.wikimedia.org/T408515) [07:31:02] (03PS2) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) [07:31:07] (03PS3) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) [07:32:02] (03PS4) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) [07:32:02] (03PS1) 10Giuseppe Lavagetto: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) [07:32:58] (03CR) 10CI reject: [V:04-1] cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [07:33:25] (03CR) 10CI reject: [V:04-1] rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [07:33:25] (03CR) 10CI reject: [V:04-1] rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [07:37:12] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:58:33] (03PS1) 10Muehlenhoff: Remove access for stevemunene [puppet] - 10https://gerrit.wikimedia.org/r/1203299 [07:59:57] (03CR) 10Vgutierrez: [C:03+1] cache/haproxy: set x-trusted-request to D for UA-compliant robots [puppet] - 10https://gerrit.wikimedia.org/r/1203054 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:03:53] (03PS1) 10Slyngshede: IDP: Switch to CAS 7.2 [dns] - 10https://gerrit.wikimedia.org/r/1203309 (https://phabricator.wikimedia.org/T406455) [08:04:01] (03CR) 10Muehlenhoff: [C:03+2] Remove access for stevemunene [puppet] - 10https://gerrit.wikimedia.org/r/1203299 (owner: 10Muehlenhoff) [08:07:43] !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Stevemunene out of all services on: 2395 hosts [08:19:34] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:19:52] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:21:28] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [08:24:40] (03PS1) 10Muehlenhoff: Remove Steve from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/1203375 [08:29:27] (03CR) 10Vgutierrez: [C:04-1] cache::text: introduce rate-limits by traffic class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [08:42:50] (03PS1) 10Vgutierrez: external_clouds_vendors: Add CCBot [puppet] - 10https://gerrit.wikimedia.org/r/1203377 [08:44:25] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11357368 (10Krd) The bounces queue is at 292k now, and increasing. Please have a look. [08:45:12] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11357370 (10Krd) {F70070832} [08:45:13] (03PS1) 10Brouberol: airflow: migrate to the image defined in the airflow-dags repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203379 (https://phabricator.wikimedia.org/T408711) [08:45:15] (03PS1) 10Brouberol: airflow: deploy the image tested on test-k8s to all instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203380 (https://phabricator.wikimedia.org/T408711) [08:47:49] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1203375 (owner: 10Muehlenhoff) [08:48:02] !log uploaded openjdk-8 8u472-ga-1~deb12u1 to apt.wikimedia.org (forward port of latest Java 8 security updates) [08:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:11] !log installing Java 8 security updates on Bookworm [08:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:18] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [08:52:33] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11357385 (10MoritzMuehlenhoff) [08:52:35] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1203377 (owner: 10Vgutierrez) [08:52:51] (03CR) 10Vgutierrez: [C:03+2] external_clouds_vendors: Add CCBot [puppet] - 10https://gerrit.wikimedia.org/r/1203377 (owner: 10Vgutierrez) [08:53:51] (03CR) 10David Caro: [C:03+1] hieradata: Remove obsolete haproxy_exporter settings [puppet] - 10https://gerrit.wikimedia.org/r/1202997 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah) [08:54:20] (03CR) 10Majavah: [C:03+2] hieradata: Remove obsolete haproxy_exporter settings [puppet] - 10https://gerrit.wikimedia.org/r/1202997 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah) [08:55:17] (03CR) 10Clément Goubert: [C:03+2] Route transform/wikitext/to/lint(.*) to the gateway on group1 [puppet] - 10https://gerrit.wikimedia.org/r/1194995 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [09:00:38] (03CR) 10Muehlenhoff: [C:03+2] Remove Steve from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/1203375 (owner: 10Muehlenhoff) [09:00:39] (03CR) 10Clément Goubert: [C:03+2] Route /page/lint(.*) to the gateway on group0 [puppet] - 10https://gerrit.wikimedia.org/r/1199033 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [09:01:26] moritzm: ok to merge? [09:03:48] (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203379 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [09:05:07] (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203380 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [09:05:50] moritzm: Assuming I can merge since it's just a user right change [09:06:07] claime: yes, please. I had been attempting to merge, but the lock was held by you [09:06:09] (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202995 (owner: 10Brouberol) [09:06:23] moritzm: hah! Merged :D [09:06:27] thanks! [09:08:48] (03PS4) 10Brouberol: dse-k8s-worker: allow stat boxes to egress to the growthbook PG service [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591) [09:09:06] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:10:55] (03PS5) 10Brouberol: dse-k8s-worker: allow stat boxes to egress to the growthbook PG service [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591) [09:12:21] 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690 (10fgiunchedi) 03NEW [09:13:15] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: enable ceph-csi-cephfs in the growthbook namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202995 (owner: 10Brouberol) [09:13:23] (03CR) 10Brouberol: [C:03+2] airflow: migrate to the image defined in the airflow-dags repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203379 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [09:13:25] (03CR) 10Brouberol: [C:03+2] airflow: deploy the image tested on test-k8s to all instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203380 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [09:13:52] (03CR) 10Gehel: [C:04-1] "minor comment on documentation" [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol) [09:15:03] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11357477 (10fgiunchedi) >>! In T399180#11310972, @cmooney wrote: >>>! In T399180#11310845, @fgiunchedi wrote: >> @taavi @Andrew @cmooney what do you think of the above? >... [09:15:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11357479 (10jcrespo) Yes, please, dc ops, file a servicing request or help us with a spare here. [09:15:50] (03Merged) 10jenkins-bot: airflow: migrate to the image defined in the airflow-dags repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203379 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [09:16:09] (03Merged) 10jenkins-bot: airflow: deploy the image tested on test-k8s to all instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203380 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [09:17:42] (03CR) 10Brouberol: dse-k8s-worker: allow stat boxes to egress to the growthbook PG service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol) [09:17:44] (03PS6) 10Brouberol: dse-k8s-worker: allow stat boxes to egress to the growthbook PG service [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591) [09:18:14] (03PS1) 10Filippo Giunchedi: cloudcephosd: switch 1048 to single interface [puppet] - 10https://gerrit.wikimedia.org/r/1203383 (https://phabricator.wikimedia.org/T399180) [09:18:16] (03PS1) 10Filippo Giunchedi: cloudcephosd: switch 1049 to single interface [puppet] - 10https://gerrit.wikimedia.org/r/1203384 (https://phabricator.wikimedia.org/T399180) [09:22:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:23:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:25:51] (03CR) 10Muehlenhoff: [C:03+1] "My tests on idp1005 were all fine" [dns] - 10https://gerrit.wikimedia.org/r/1203309 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:26:17] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11357512 (10jcrespo) For Robh, a bit of a background on the requirements for production dbs, from a backup perspective, so he has the global undestanding of our aim. Databases rar... [09:27:44] (03CR) 10Muehlenhoff: [C:03+2] Remove historic comments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202127 (owner: 10Muehlenhoff) [09:29:20] (03CR) 10Btullis: [C:03+1] dse-k8s-worker: allow stat boxes to egress to the growthbook PG service [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol) [09:30:36] (03CR) 10Muehlenhoff: [C:03+2] osm_master: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1199005 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:32:59] (03PS1) 10Brouberol: airflow-analytics-test: use the common airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203387 (https://phabricator.wikimedia.org/T408711) [09:33:28] (03CR) 10Btullis: [C:03+1] growthbook: define configuration for local file uploads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol) [09:33:51] (03PS2) 10Muehlenhoff: osm_replica: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1202664 (https://phabricator.wikimedia.org/T381565) [09:34:01] (03CR) 10Btullis: [C:03+1] airflow: migrate to the image defined in the airflow-dags repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203379 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [09:34:24] (03CR) 10Btullis: [C:03+1] airflow: deploy the image tested on test-k8s to all instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203380 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [09:34:51] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: enable ceph-csi-cephfs in the growthbook namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202995 (owner: 10Brouberol) [09:35:43] (03CR) 10Brouberol: [C:03+2] dse-k8s-worker: allow stat boxes to egress to the growthbook PG service [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol) [09:36:50] (03CR) 10Btullis: [C:03+1] airflow-analytics-test: use the common airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203387 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [09:37:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1202664 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:39:02] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: use the common airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203387 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol) [09:39:06] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:07] (03CR) 10Brouberol: [C:03+2] growthbook: define configuration for local file uploads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol) [09:48:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:49:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:50:14] (03PS1) 10Muehlenhoff: Fix cumin alias for maps [puppet] - 10https://gerrit.wikimedia.org/r/1203390 (https://phabricator.wikimedia.org/T381565) [09:50:20] (03CR) 10Clément Goubert: [C:04-1] "The CI fails because of this error:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [09:52:55] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-d6-eqiad (10.64.128.29) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:57:24] (03PS1) 10Dpogorzelski: aya-llm: fix tolerations and affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203396 (https://phabricator.wikimedia.org/T403697) [09:58:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:59:26] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:59:34] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:59:45] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:59:56] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:01:57] FIRING: [20x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:02:37] (03CR) 10Slyngshede: [C:03+2] IDP: Switch to CAS 7.2 [dns] - 10https://gerrit.wikimedia.org/r/1203309 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [10:03:17] !log Upgrade CAS (idp.wikimedia.org) to version 7.2.7 [10:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:43] !log slyngshede@dns1004 START - running authdns-update [10:04:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:04:38] !log slyngshede@dns1004 END - running authdns-update [10:04:44] (03PS1) 10Majavah: hieradata: Enable jumbo frames on all codwf1dev cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1203398 (https://phabricator.wikimedia.org/T330075) [10:04:46] (03PS1) 10Majavah: hieradata: Enable jumbo frames on codfw1dev cloudnets [puppet] - 10https://gerrit.wikimedia.org/r/1203399 (https://phabricator.wikimedia.org/T330075) [10:04:48] (03PS1) 10Majavah: hieradata: Enable jumbo frames on remaining codfw1dev nodes [puppet] - 10https://gerrit.wikimedia.org/r/1203400 (https://phabricator.wikimedia.org/T330075) [10:06:57] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:08:00] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1203398 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [10:08:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:08:47] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Enable jumbo frames on all codwf1dev cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1203398 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [10:09:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:10:18] (03PS4) 10Esanders: Enable DiscussionTools visual enhancements everywhere except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133975 (https://phabricator.wikimedia.org/T379264) [10:10:41] (03CR) 10Elukey: [C:03+1] aya-llm: fix tolerations and affinity (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203396 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski) [10:11:59] (03PS2) 10Dpogorzelski: ml-services: fix tolerations and affinity for aya-llm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203396 (https://phabricator.wikimedia.org/T403697) [10:12:21] (03CR) 10Dpogorzelski: [C:03+2] ml-services: fix tolerations and affinity for aya-llm (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203396 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski) [10:14:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:14:14] (03Merged) 10jenkins-bot: ml-services: fix tolerations and affinity for aya-llm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203396 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski) [10:15:19] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707 (10Chandra-WMDE) 03NEW [10:16:19] (03CR) 10Clément Goubert: [C:04-1] Note that per-route rate limits require Envoy 1.33 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 (owner: 10Daniel Kinzler) [10:16:57] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:18:22] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:23:22] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:37] (03CR) 10Vgutierrez: "code looks good, please fix the indentation issues mentioned in the inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [10:24:06] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11357869 (10AndrewTavis_WMDE) [10:26:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:28:22] FIRING: [7x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:33:35] (03PS1) 10Vgutierrez: varnish: Drop trusted proxies support [puppet] - 10https://gerrit.wikimedia.org/r/1203401 [10:34:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:38:05] (03PS1) 10Majavah: P:openstack: neutron: Enable jumbo frames in codfw1dev Neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544) [10:44:25] (03PS2) 10Majavah: P:openstack: neutron: Enable jumbo frames in codfw1dev Neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544) [10:49:48] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7591/co" [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah) [10:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:51:13] (03CR) 10Clément Goubert: "Yeah that's half a worker per pod. Checking thanos https://w.wiki/G23r afaict right now there would only be 3 workers, all in `eqiad`, tha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) (owner: 10KartikMistry) [10:52:38] (03PS1) 10Tiziano Fogli: metamonitoring/icinga/ext-mon: add dummy basic auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1203405 (https://phabricator.wikimedia.org/T397003) [10:52:45] (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga/ext-mon: add dummy basic auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1203405 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:52:47] (03CR) 10Tiziano Fogli: [V:03+2 C:03+2] metamonitoring/icinga/ext-mon: add dummy basic auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1203405 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:58:55] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11357931 (10Nahid) 05Resolved→03Open Hey all - Thanks for attending this task. I am re-opening the task but please let me know if it needs a new ticket. Sarah is having... [10:59:42] (03PS1) 10Muehlenhoff: Record LDAP access for khernandez [puppet] - 10https://gerrit.wikimedia.org/r/1203407 [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1100) [11:00:17] (03CR) 10Clément Goubert: "Yep that's exactly that." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga) [11:00:38] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for khernandez [puppet] - 10https://gerrit.wikimedia.org/r/1203407 (owner: 10Muehlenhoff) [11:04:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [11:06:07] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: define catch-all rate limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202998 (https://phabricator.wikimedia.org/T409543) (owner: 10Daniel Kinzler) [11:08:25] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) (owner: 10Pmiazga) [11:08:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet [11:09:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet [11:10:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet [11:10:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11357966 (10MoritzMuehlenhoff) @RobH ganeti1024 and ganeti1033 are drained and can be migrated. [11:12:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11357969 (10MoritzMuehlenhoff) [11:14:14] 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712 (10jcrespo) 03NEW [11:18:04] 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11357982 (10jcrespo) [11:20:01] (03PS1) 10Silvan Heintze: Report integrity metric from wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) [11:20:01] (03CR) 10Silvan Heintze: "As discussed: for this to work, an added network policy is needed to allow access from the kubernates pods to the push gateway." [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [11:21:14] 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11357987 (10jcrespo) [11:21:57] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:23:22] FIRING: [7x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: (Re-)Add monitoring for the internal Ganeti certs - https://phabricator.wikimedia.org/T382902#11358007 (10MoritzMuehlenhoff) 05Open→03Resolved I've added a new Prometheus exporter to all Ganeti nodes (which only runs on the masters), which detects the re... [11:24:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:25:09] (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198935 (https://phabricator.wikimedia.org/T408223) [11:25:32] 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11358012 (10jcrespo) It shows up also here: {F70074799} Maybe it is expected? [11:26:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:26:37] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2011 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:26:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:26:57] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:27:06] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:27:35] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2011 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:28:15] 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11358013 (10jcrespo) The found workaround: https://alerts.wikimedia.org/?q=%40silenced_by%3D6c0e20b0-632b-4410-be33-32f631f020a5 [11:28:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:29:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:29:03] 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11358016 (10Volans) Possibly related to T328869 [11:31:57] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:34:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:37:20] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198935 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [11:38:17] (03PS2) 10Vgutierrez: varnish: Drop trusted proxies support [puppet] - 10https://gerrit.wikimedia.org/r/1203401 [11:38:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11358084 (10Jclark-ctr) @jcrespo can this be swapped at anytime or do we need to schedule? [11:39:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:40:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11358086 (10jcrespo) Go ahead if it doesn't require shutdown. If it requires or it is preferred, just let me know and I will perfor mit myself right now, will... [11:41:57] FIRING: [20x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:44:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:46:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:48:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2010:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:49:02] RESOLVED: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:51:37] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1182658 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [11:53:14] (03PS1) 10Vgutierrez: varnish: Remove abuse_networks netmapper lookup [puppet] - 10https://gerrit.wikimedia.org/r/1203415 [11:53:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2010:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:54:31] (03PS2) 10Vgutierrez: varnish: Remove abuse_networks netmapper lookup [puppet] - 10https://gerrit.wikimedia.org/r/1203415 [11:56:43] 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11358164 (10jcrespo) A yes, I couldn't find it before. I will merge it there. [11:56:54] (03CR) 10Slyngshede: [C:03+1] varnish: Remove abuse_networks netmapper lookup [puppet] - 10https://gerrit.wikimedia.org/r/1203415 (owner: 10Vgutierrez) [11:57:10] 06SRE, 10Observability-Alerting: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869#11358168 (10jcrespo) [11:57:12] 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11358171 (10jcrespo) →14Duplicate dup:03T328869 [12:00:34] (03CR) 10Vgutierrez: [V:03+2] "varnishtests are happy on both text & upload" [puppet] - 10https://gerrit.wikimedia.org/r/1203415 (owner: 10Vgutierrez) [12:00:58] (03PS1) 10STran: Deploy temporary accounts to more large projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203416 (https://phabricator.wikimedia.org/T409691) [12:02:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.185s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:02:45] (03PS2) 10STran: Deploy temporary accounts to more large/LQT-unblocked projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203416 (https://phabricator.wikimedia.org/T409691) [12:12:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:12:22] (03PS1) 10Brouberol: global_config: define a prometheus external service [puppet] - 10https://gerrit.wikimedia.org/r/1203418 (https://phabricator.wikimedia.org/T403482) [12:14:50] (03PS1) 10Brouberol: airflow-test-k8s: allow task-pod -> prometheus gateway egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203420 (https://phabricator.wikimedia.org/T403482) [12:22:29] (03CR) 10Btullis: global_config: define a prometheus external service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203418 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol) [12:22:42] (03PS1) 10Majavah: P:toolfroge::elasticsearch::haproxy: Use firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1203425 [12:22:42] (03PS1) 10Majavah: P:toolfroge::elasticsearch::haproxy: Enable native Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1203426 (https://phabricator.wikimedia.org/T343885) [12:22:44] (03PS1) 10Majavah: P:toolforge::prometheus: Use native exporters for HAProxy targets [puppet] - 10https://gerrit.wikimedia.org/r/1203427 (https://phabricator.wikimedia.org/T343885) [12:22:46] (03PS1) 10Majavah: P:toolforge: Remove legacy HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1203428 [12:23:32] (03PS2) 10Majavah: P:toolforge::elasticsearch::haproxy: Use firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1203425 [12:23:32] (03PS2) 10Majavah: P:toolforge::elasticsearch::haproxy: Enable native Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1203426 (https://phabricator.wikimedia.org/T343885) [12:23:32] (03PS2) 10Majavah: P:toolforge::prometheus: Use native exporters for HAProxy targets [puppet] - 10https://gerrit.wikimedia.org/r/1203427 (https://phabricator.wikimedia.org/T343885) [12:23:32] (03PS2) 10Majavah: P:toolforge: Remove legacy HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1203428 [12:24:13] (03PS2) 10Brouberol: global_config: define a prometheus external service [puppet] - 10https://gerrit.wikimedia.org/r/1203418 (https://phabricator.wikimedia.org/T403482) [12:24:21] (03CR) 10Btullis: airflow-test-k8s: allow task-pod -> prometheus gateway egress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203420 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol) [12:24:26] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7598/co" [puppet] - 10https://gerrit.wikimedia.org/r/1203425 (owner: 10Majavah) [12:24:26] FIRING: InboundMXQueueHigh: MX host mx-in1001:9154 has many queued messages: 4652 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh [12:24:28] (03PS2) 10Brouberol: airflow-test-k8s: allow task-pod -> prometheus gateway egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203420 (https://phabricator.wikimedia.org/T403482) [12:24:45] (03CR) 10Brouberol: global_config: define a prometheus external service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203418 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol) [12:25:01] (03CR) 10Brouberol: airflow-test-k8s: allow task-pod -> prometheus gateway egress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203420 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol) [12:25:03] (03CR) 10Federico Ceratto: [C:03+2] db2166: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202775 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [12:25:15] (03PS1) 10Jcrespo: dbbackups: Upgrade db2199, last backup source with 10.6 to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1203429 (https://phabricator.wikimedia.org/T394487) [12:27:18] (03PS2) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [12:27:43] I think the bot didn't work, but there is an ongoing #page related to MX queue [12:27:50] (03CR) 10Btullis: [C:03+1] global_config: define a prometheus external service [puppet] - 10https://gerrit.wikimedia.org/r/1203418 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol) [12:28:04] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: allow task-pod -> prometheus gateway egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203420 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol) [12:28:11] ah, it worked, just there is a lot of messages here [12:28:19] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:28:41] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2166 - Upgrading db2166.codfw.wmnet [12:29:00] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2166 - Upgrading db2166.codfw.wmnet [12:29:07] things started bad at 9:30 [12:29:18] (03CR) 10CI reject: [V:04-1] Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis) [12:31:05] here sorry [12:31:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:32:23] acked from the phone [12:32:24] Here as well. Ack'd the page and checking docs if there is anything that can be done. [12:33:28] issue is on wikipedia with P [12:33:39] (03CR) 10Brouberol: [C:03+2] global_config: define a prometheus external service [puppet] - 10https://gerrit.wikimedia.org/r/1203418 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol) [12:33:48] fceratto@cumin1003 major-upgrade (PID 1435157) is awaiting input [12:34:30] I have zero knowledge of the MX queue so I'll need to take a bit more to check the metrics, but jynus is right, something seems started at around 9:30 [12:34:35] (03CR) 10Kamila Součková: [C:03+1] mediawiki: Update location of startupregistrystats script [puppet] - 10https://gerrit.wikimedia.org/r/1202872 (https://phabricator.wikimedia.org/T409212) (owner: 10Zabe) [12:35:11] judging from the queue size though it seems that it got worse two hours afterwards [12:36:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:37:32] (03CR) 10Nikerabbit: [C:03+1] Remove SpecialContributeSkinsEnabled for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [12:37:50] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203434 [12:38:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11358311 (10Raine) >>! In T407094#11357931, @Nahid wrote: > It looks like the dot < **. **> at the end of the public key is missing in the patch. The dot is actually part o... [12:38:34] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11358313 (10jcrespo) We are working on it, alarm notified us. [12:42:20] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: allow task-pod -> prometheus gateway egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203420 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol) [12:43:18] !log drop database if exists de_labswikimedia; drop database if exists en_labswikimedia; drop database if exists flaggedrevs_labswikimedia; drop database if exists liquidthreads_labswikimedia; drop database if exists readerfeedback_labswikimedia; (T297297) [12:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:21] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [12:44:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:44:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:46:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:46:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:49:20] (03PS1) 10Brouberol: Revert "airflow-test-k8s: allow task-pod -> prometheus gateway egress" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203444 [12:49:40] !log drop database if exists tokiponawiki; drop database if exists tokiponawikibooks; drop database if exists tokiponawikiquote; drop database if exists tokiponawiktionary; (T297297) [12:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:44] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [12:50:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:53:32] (03PS1) 10Brouberol: mediawiki-dumps-legacy: enable egress to the prometheus-pushgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203445 (https://phabricator.wikimedia.org/T403482) [12:55:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:56:46] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2166 gradually with 4 steps - Migration of db2166.codfw.wmnet completed [12:59:37] (03CR) 10Andrew Bogott: [C:03+2] clouddb1026-1033: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1203271 (https://phabricator.wikimedia.org/T409162) (owner: 10Andrew Bogott) [12:59:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11358376 (10Jclark-ctr) [12:59:57] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T409646#11358378 (10Jclark-ctr) →14Duplicate dup:03T408065 [13:00:14] (03PS3) 10Andrew Bogott: clouddb1026-1033: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1203271 (https://phabricator.wikimedia.org/T409162) [13:00:15] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1203271 (https://phabricator.wikimedia.org/T409162) (owner: 10Andrew Bogott) [13:00:34] (03PS2) 10Brouberol: mediawiki-dumps-legacy: enable egress to the prometheus-pushgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203445 (https://phabricator.wikimedia.org/T403482) [13:02:10] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11358382 (10elukey) I am really really ignorant about postfix so please bear with me :) I ran: ` elukey@mx-in1001:~$ f... [13:04:03] (03CR) 10Majavah: [C:04-1] toolforge haproxy config: replace httpchk with http-check send (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203175 (owner: 10Andrew Bogott) [13:05:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [13:05:41] (03CR) 10Andrew Bogott: [C:03+2] clouddb1026-1033: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1203271 (https://phabricator.wikimedia.org/T409162) (owner: 10Andrew Bogott) [13:05:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [13:07:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [13:07:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [13:08:12] (03PS3) 10Andrew Bogott: toolforge haproxy config: replace httpchk with http-check send [puppet] - 10https://gerrit.wikimedia.org/r/1203175 [13:08:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:09:06] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:09:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:10:46] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 3 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11358432 (10Arnoldokoth) [13:11:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:11:44] !log restart postfix on mx-in2001 to apply an IP ban - T408632 [13:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:48] T408632: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632 [13:12:02] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11358441 (10Arnoldokoth) [13:12:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:12:24] !log restart postfix on mx-in1001 to apply an IP ban - T408632 [13:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:57] !log mwscript-k8s --dblist=medium --follow -- purgeUserOptions.php --login-age 15 (T406724) [13:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:00] T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724 [13:15:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [13:15:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [13:16:24] (03CR) 10Andrew Bogott: [C:03+1] "seems good!" [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah) [13:16:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [13:16:36] jouncebot: nowandnext [13:16:36] No deployments scheduled for the next 0 hour(s) and 43 minute(s) [13:16:36] In 0 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1400) [13:16:48] (03CR) 10Majavah: [C:03+2] hieradata: Enable jumbo frames on codfw1dev cloudnets [puppet] - 10https://gerrit.wikimedia.org/r/1203399 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [13:17:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [13:17:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11358469 (10Jclark-ctr) a:03Jclark-ctr [13:17:42] (03CR) 10Ladsgroup: [C:03+2] Remove nlwiki exception from thumb limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201584 (https://phabricator.wikimedia.org/T408715) (owner: 10Ladsgroup) [13:18:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201584 (https://phabricator.wikimedia.org/T408715) (owner: 10Ladsgroup) [13:18:26] (03Merged) 10jenkins-bot: Remove nlwiki exception from thumb limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201584 (https://phabricator.wikimedia.org/T408715) (owner: 10Ladsgroup) [13:19:05] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1201584|Remove nlwiki exception from thumb limits (T408715)]] [13:19:08] T408715: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715 [13:19:26] FIRING: [2x] InboundMXQueueHigh: MX host mx-in1001:9154 has many queued messages: 22380 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh [13:23:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:27:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [13:27:28] (03CR) 10Majavah: [C:03+2] hieradata: Enable jumbo frames on remaining codfw1dev nodes [puppet] - 10https://gerrit.wikimedia.org/r/1203400 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [13:27:38] (03PS3) 10KartikMistry: machinetranslation: Increase replica and memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) [13:27:38] (03PS1) 10KartikMistry: Update Recommnedation API to 2025-11-07-162011-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203450 (https://phabricator.wikimedia.org/T405000) [13:28:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [13:28:20] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11358526 (10elukey) Judging from the [[ https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers?orgId=1&from=now-... [13:28:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:29:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [13:30:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [13:30:44] (03PS2) 10KartikMistry: Update Recommnedation API to 2025-11-07-162011-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203450 (https://phabricator.wikimedia.org/T405000) [13:33:51] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: neutron: Enable jumbo frames in codfw1dev Neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah) [13:34:43] (03PS3) 10Majavah: P:openstack: neutron: Enable jumbo frames in codfw1dev Neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544) [13:35:07] (03CR) 10Sbisson: [C:03+1] Update Recommnedation API to 2025-11-07-162011-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203450 (https://phabricator.wikimedia.org/T405000) (owner: 10KartikMistry) [13:35:51] (03CR) 10KartikMistry: [C:03+2] Update Recommnedation API to 2025-11-07-162011-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203450 (https://phabricator.wikimedia.org/T405000) (owner: 10KartikMistry) [13:37:05] (03CR) 10Majavah: [C:03+2] P:openstack: neutron: Enable jumbo frames in codfw1dev Neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah) [13:37:31] (03Merged) 10jenkins-bot: Update Recommnedation API to 2025-11-07-162011-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203450 (https://phabricator.wikimedia.org/T405000) (owner: 10KartikMistry) [13:41:21] (03PS1) 10Brouberol: airflow: assume the PYTHONPATH env var is defined in the airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203451 (https://phabricator.wikimedia.org/T408711) [13:41:38] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:42:15] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2166 gradually with 4 steps - Migration of db2166.codfw.wmnet completed [13:42:16] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [13:43:36] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1201584|Remove nlwiki exception from thumb limits (T408715)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:43:39] T408715: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715 [13:44:48] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [13:47:49] 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731 (10cmooney) 03NEW p:05Triage→03High [13:50:28] (03PS14) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) [13:50:41] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:51:04] (03PS1) 10Dpogorzelski: ml-serve: tweak aya llm mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203453 (https://phabricator.wikimedia.org/T403697) [13:51:07] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [13:51:27] (03CR) 10Vgutierrez: [V:03+2 C:03+2] varnish: Remove abuse_networks netmapper lookup [puppet] - 10https://gerrit.wikimedia.org/r/1203415 (owner: 10Vgutierrez) [13:51:29] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2167 - Upgrading db2167.codfw.wmnet [13:51:41] I've merged patch https://gerrit.wikimedia.org/r/1203450 but it isn't available on the deployment server to deploy. What can be the reason? The diff is empty. [13:51:58] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2167 - Upgrading db2167.codfw.wmnet [13:52:42] (03CR) 10Federico Ceratto: [C:03+2] db2167: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202776 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [13:52:55] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-d6-eqiad (10.64.128.29) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:53:28] 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11358656 (10cmooney) I can take a look at this unless there is another plan? [13:55:11] (03CR) 10Vgutierrez: [C:03+1] trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: 10Krinkle) [13:57:06] fceratto@cumin1003 major-upgrade (PID 1518129) is awaiting input [13:57:40] (03PS3) 10Vgutierrez: varnish: Drop trusted proxies support [puppet] - 10https://gerrit.wikimedia.org/r/1203401 [13:57:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza) [13:59:59] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201584|Remove nlwiki exception from thumb limits (T408715)]] (duration: 40m 54s) [14:00:03] T408715: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715 [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1400). [14:00:06] edsanders and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] o/ [14:00:31] I can self deploy [14:02:26] (03CR) 10Vgutierrez: [V:03+2] "varnishtests are happy for both text & upload clusters" [puppet] - 10https://gerrit.wikimedia.org/r/1203401 (owner: 10Vgutierrez) [14:02:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202985 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [14:03:42] (03Merged) 10jenkins-bot: Freeze LiquidThreads on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202985 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders) [14:04:01] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1202985|Freeze LiquidThreads on enwiktionary (T405080)]] [14:04:05] T405080: Convert LQT pages on enwiktionary to Flow - https://phabricator.wikimedia.org/T405080 [14:06:41] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [14:08:08] !log esanders@deploy2002 esanders: Backport for [[gerrit:1202985|Freeze LiquidThreads on enwiktionary (T405080)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:08:31] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:09:59] (03CR) 10Btullis: [C:03+1] Revert "airflow-test-k8s: allow task-pod -> prometheus gateway egress" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203444 (owner: 10Brouberol) [14:10:03] (03CR) 10Elukey: "The change is ok but is aya meant to run with only 8G of memory? I'd defer to Aiko for the final +1, since I suspect that we may need more" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203453 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski) [14:10:44] !log esanders@deploy2002 esanders: Continuing with sync [14:10:47] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: enable egress to the prometheus-pushgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203445 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol) [14:13:22] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:26] RESOLVED: [2x] InboundMXQueueHigh: MX host mx-in1001:9154 has many queued messages: 1058 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh [14:15:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11358737 (10Jclark-ctr) @jcrespo This server is out of warranty. I replaced the disk with one from a decommissioned server; the drive was erased prior to inst... [14:15:56] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:17:49] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202985|Freeze LiquidThreads on enwiktionary (T405080)]] (duration: 13m 48s) [14:17:53] T405080: Convert LQT pages on enwiktionary to Flow - https://phabricator.wikimedia.org/T405080 [14:20:56] !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:23:41] (03CR) 10BBlack: [C:03+1] "LGTM! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1203401 (owner: 10Vgutierrez) [14:26:10] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2167 gradually with 4 steps - Migration of db2167.codfw.wmnet completed [14:31:12] !log Update Recommnedation API to 2025-11-07-162011-production (T405000, T406854, T408936, T408937, T408934) [14:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:25] T405000: Handle failure to load languages from cx server - https://phabricator.wikimedia.org/T405000 [14:31:26] T406854: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854 [14:31:26] T408936: Error when calling the wikidata API with no titles - https://phabricator.wikimedia.org/T408936 [14:31:27] T408937: Faulty error handling when fetching language pairs - https://phabricator.wikimedia.org/T408937 [14:31:28] T408934: Production error: AttributeError: 'NoneType' object has no attribute 'keys' - https://phabricator.wikimedia.org/T408934 [14:31:36] (03PS1) 10Esanders: Create maintenance script to apply manual fixes [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203457 (https://phabricator.wikimedia.org/T397426) [14:31:53] (03Abandoned) 10CDanis: discovery.wmnet: add gerrit alias [dns] - 10https://gerrit.wikimedia.org/r/1198352 (https://phabricator.wikimedia.org/T365259) (owner: 10CDanis) [14:32:00] (03CR) 10Tchanders: [C:03+1] Deploy temporary accounts to more large/LQT-unblocked projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203416 (https://phabricator.wikimedia.org/T409691) (owner: 10STran) [14:32:04] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11358831 (10elukey) Looks like we are back in acceptable ranges again! Please let me know if anything is missing. [14:32:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203457 (https://phabricator.wikimedia.org/T397426) (owner: 10Esanders) [14:32:38] Hi sorry I've some problem with IRC... Are you deploying rn? [14:33:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203457 (https://phabricator.wikimedia.org/T397426) (owner: 10Esanders) [14:33:30] Superpes52: hey I just started my second patch [14:34:11] Superpes52: shall I do your config change after, or can you deploy by yourself? [14:34:42] edsanders Ah so I'm on time! I'm not a deployer :) [14:34:57] ok, no problem [14:37:23] (03CR) 10Federico Ceratto: [C:03+1] "I also suggest using the default permission (0755) for consistency and treating the whole data directory and Prometheus as Grafana instanc" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [14:41:59] 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11358897 (10fgiunchedi) Yes please @cmooney, much appreciated! Note that this is currently not a blocker / not high... [14:42:44] (03Merged) 10jenkins-bot: Create maintenance script to apply manual fixes [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203457 (https://phabricator.wikimedia.org/T397426) (owner: 10Esanders) [14:43:06] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1203457|Create maintenance script to apply manual fixes (T397426)]] [14:43:10] T397426: Implement bulk fixes on ptwikibooks - https://phabricator.wikimedia.org/T397426 [14:45:05] !log esanders@deploy2002 esanders: Backport for [[gerrit:1203457|Create maintenance script to apply manual fixes (T397426)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:45:30] !log esanders@deploy2002 esanders: Continuing with sync [14:45:51] PROBLEM - Host cloudnet2005-dev is DOWN: PING CRITICAL - Packet loss = 100% [14:46:33] (03PS1) 10Bking: opensearch-cluster: raise defaults to match design doc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203458 (https://phabricator.wikimedia.org/T409501) [14:48:19] RECOVERY - Host cloudnet2005-dev is UP: PING OK - Packet loss = 0%, RTA = 31.39 ms [14:49:09] 06SRE, 06Traffic: Meta query about why we map 31.13.103.0/24 to US - https://phabricator.wikimedia.org/T409735 (10cmooney) 03NEW p:05Triage→03Medium [14:50:27] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203457|Create maintenance script to apply manual fixes (T397426)]] (duration: 07m 21s) [14:50:32] T397426: Implement bulk fixes on ptwikibooks - https://phabricator.wikimedia.org/T397426 [14:50:39] PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100% [14:50:48] Superpes52: shall I start your config change? [14:50:59] Yep thanks edsanders [14:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:51:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203238 (https://phabricator.wikimedia.org/T409578) (owner: 10Superpes15) [14:52:05] (03Merged) 10jenkins-bot: [ptwiki] Add new abusefilter usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203238 (https://phabricator.wikimedia.org/T409578) (owner: 10Superpes15) [14:52:22] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1203238|[ptwiki] Add new abusefilter usergroup (T409578)]] [14:52:26] T409578: Create new user group on ptwiki "Administrador do filtro de abusos" - https://phabricator.wikimedia.org/T409578 [14:53:35] RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [14:54:53] !log esanders@deploy2002 superpes, esanders: Backport for [[gerrit:1203238|[ptwiki] Add new abusefilter usergroup (T409578)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:55:04] Testing [14:55:37] Looks fine thanks edsanders :) [14:55:43] !log esanders@deploy2002 superpes, esanders: Continuing with sync [15:00:00] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203238|[ptwiki] Add new abusefilter usergroup (T409578)]] (duration: 07m 37s) [15:00:04] T409578: Create new user group on ptwiki "Administrador do filtro de abusos" - https://phabricator.wikimedia.org/T409578 [15:00:21] Many thanks for your assistance edsanders :3 [15:01:53] no problem [15:05:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11359057 (10jcrespo) I will keep an eye on it until it gets rebuilt, thanks for the quick help. I will also have a look at the warnings. [15:08:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:32] (03PS1) 10CDanis: base: add bat (batcat) to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1203462 [15:11:39] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2167 gradually with 4 steps - Migration of db2167.codfw.wmnet completed [15:11:40] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [15:15:31] (03CR) 10Dpogorzelski: "i just need to be able to iterate and make tiny but consistent progress. these are mostly dev changes so i think i'll just go ahead and me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203453 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski) [15:18:52] andrew@cumin2002 reimage (PID 1974214) is awaiting input [15:25:16] !log drop database if exists webshop (T297297) [15:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:20] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [15:26:26] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: enable egress to the prometheus-pushgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203445 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol) [15:26:41] (03CR) 10Brouberol: [C:03+2] Revert "airflow-test-k8s: allow task-pod -> prometheus gateway egress" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203444 (owner: 10Brouberol) [15:26:58] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: tweak aya llm mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203453 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski) [15:27:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11359162 (10jcrespo) I saw the warnings, but I see no problem on the logs, other than it detecting your disk change and firmware update. Once the disk rebuild... [15:29:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:29:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1530) [15:32:11] (03CR) 10Vgutierrez: [V:03+2 C:03+2] varnish: Drop trusted proxies support [puppet] - 10https://gerrit.wikimedia.org/r/1203401 (owner: 10Vgutierrez) [15:33:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:18] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [15:37:40] jouncebot: nowandnext [15:37:40] For the next 0 hour(s) and 22 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1530) [15:37:40] In 0 hour(s) and 52 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1630) [15:40:26] 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11359233 (10LSobanski) p:05Triage→03Medium [15:41:00] (03PS1) 10Dpogorzelski: ml-services: reduce aya's cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203473 (https://phabricator.wikimedia.org/T409414) [15:41:33] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11359238 (10LSobanski) 05Open→03Resolved Resolving, please reopen if you still think this is a problem. [15:42:56] (03CR) 10Elukey: "Post-merge comment: I totally understand you point but we have had problems in the past with models taking some extra memory when loading " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203453 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski) [15:42:58] (03CR) 10Dpogorzelski: [C:03+2] ml-services: reduce aya's cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203473 (https://phabricator.wikimedia.org/T409414) (owner: 10Dpogorzelski) [15:46:31] 06SRE, 06Infrastructure-Foundations, 10netops: Sporadic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823#11359271 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving as part of backlog review. There have been changes to the network and Puppet since the creation of this ta... [15:46:44] (03PS1) 10Majavah: P:wmcs::spicerack_config: Do not log changes of secrets [puppet] - 10https://gerrit.wikimedia.org/r/1203474 (https://phabricator.wikimedia.org/T409741) [15:46:45] (03PS2) 10Dpogorzelski: ml-services: reduce aya's cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203473 (https://phabricator.wikimedia.org/T409414) [15:47:09] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml-services: reduce aya's cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203473 (https://phabricator.wikimedia.org/T409414) (owner: 10Dpogorzelski) [15:47:52] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [15:50:16] (03CR) 10Scott French: [C:03+2] hieradata: pilot cfssl/pki for etcd on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1182658 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [15:51:13] I'm going to merge a change shortly that will require a short disruption on a single conftool etcd node in codfw [15:51:47] although it's unlikely this will cause problems, it's preferable if no mediawiki deployment is ongoing concurrently [15:52:26] as such, I am going to briefly take the scap lock while the change is happening [15:52:38] thanks for deploying edsanders <3 [15:52:42] (I was busy earlier) [15:52:50] (03CR) 10David Caro: [C:03+1] P:wmcs::spicerack_config: Do not log changes of secrets [puppet] - 10https://gerrit.wikimedia.org/r/1203474 (https://phabricator.wikimedia.org/T409741) (owner: 10Majavah) [15:53:28] !log swfrench@deploy2002 Locking from deployment [ALL REPOSITORIES]: Hold deployments during etcd certificate change - T352245 [15:53:32] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [15:56:48] (03CR) 10Majavah: [C:03+2] P:wmcs::spicerack_config: Do not log changes of secrets [puppet] - 10https://gerrit.wikimedia.org/r/1203474 (https://phabricator.wikimedia.org/T409741) (owner: 10Majavah) [16:05:06] !log swfrench@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Hold deployments during etcd certificate change - T352245 (duration: 11m 38s) [16:05:27] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [16:05:31] the dust has settled and I've released the lock. thanks all [16:08:35] (03CR) 10Federico Ceratto: [C:03+2] db2181: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202777 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [16:10:11] !log begin rolling restart of codfw-associated confds after conf2006 etcd restart - T352245 [16:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:54] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [16:11:17] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2181 - Upgrading db2181.codfw.wmnet [16:11:35] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2181 - Upgrading db2181.codfw.wmnet [16:13:06] (03PS1) 10Muehlenhoff: sre.ganeti.reboot-vm: Use skip_acked=True [cookbooks] - 10https://gerrit.wikimedia.org/r/1203483 (https://phabricator.wikimedia.org/T330136) [16:14:04] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2199.codfw.wmnet with reason: MariaDB upgrade [16:14:35] fceratto@cumin1003 major-upgrade (PID 1656166) is awaiting input [16:16:28] (03PS1) 10Brouberol: growthbook: add omitted pvc.yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203486 (https://phabricator.wikimedia.org/T408415) [16:17:13] (03CR) 10Muehlenhoff: "It's more, it's also the official tool to self-manage your developer account (change email, change Cloud SSH keys e.g.). In addition it ha" [puppet] - 10https://gerrit.wikimedia.org/r/1202114 (owner: 10Muehlenhoff) [16:18:29] (03CR) 10CI reject: [V:04-1] sre.ganeti.reboot-vm: Use skip_acked=True [cookbooks] - 10https://gerrit.wikimedia.org/r/1203483 (https://phabricator.wikimedia.org/T330136) (owner: 10Muehlenhoff) [16:18:47] (03PS1) 10Brouberol: Define the synthetic data PG data source in configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203487 (https://phabricator.wikimedia.org/T409591) [16:21:12] (03PS2) 10Brouberol: growthbook: add omitted pvc.yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203486 (https://phabricator.wikimedia.org/T408415) [16:21:28] (03CR) 10Brouberol: [C:03+2] Define the synthetic data PG data source in configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203487 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol) [16:23:36] (03PS2) 10Muehlenhoff: sre.ganeti.reboot-vm: Use skip_acked=True [cookbooks] - 10https://gerrit.wikimedia.org/r/1203483 (https://phabricator.wikimedia.org/T330136) [16:23:38] (03CR) 10Brouberol: [C:03+2] growthbook: add omitted pvc.yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203486 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol) [16:23:45] (03PS1) 10CDanis: admin: deployment: add volker-e & new ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1203488 (https://phabricator.wikimedia.org/T406243) [16:25:21] (03CR) 10CDanis: "Please confirm whether or not you still want the old ssh-rsa key kept active as well, and then we'll get the access updated too. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1203488 (https://phabricator.wikimedia.org/T406243) (owner: 10CDanis) [16:25:25] fceratto@cumin1003 major-upgrade (PID 1656166) is awaiting input [16:27:26] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11359455 (10Jclark-ctr) I did finally get confirmation on tracking on replacement memory It should be onsite by end of day tomorrow Unless Delayed by holiday. Can i repla... [16:27:55] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11359467 (10Jclark-ctr) a:05Marostegui→03Jclark-ctr [16:28:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:28:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:29:36] (03CR) 10CI reject: [V:04-1] sre.ganeti.reboot-vm: Use skip_acked=True [cookbooks] - 10https://gerrit.wikimedia.org/r/1203483 (https://phabricator.wikimedia.org/T330136) (owner: 10Muehlenhoff) [16:30:05] jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1630). [16:30:59] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply updates - bking@cumin1002 - T407110 [16:31:35] !log bking@cumin1002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply updates - bking@cumin1002 - T407110 [16:31:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731#11359494 (10Jclark-ctr) [16:31:38] (03PS4) 10Jdlrobson: Update QuickSurvey platforms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482 [16:31:47] (03CR) 10CDanis: [C:03+2] admin: Update brett SSH key to FIDO [puppet] - 10https://gerrit.wikimedia.org/r/1203179 (https://phabricator.wikimedia.org/T409600) (owner: 10BCornwall) [16:31:55] (03PS1) 10Mmartorana: Security-landing-page: bump image to 2025-10-27-155537 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203489 (https://phabricator.wikimedia.org/T404996) [16:32:06] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: reboot to apply updates - bking@cumin1002 - T407110 [16:34:02] !log bking@cumin1002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: reboot to apply updates - bking@cumin1002 - T407110 [16:34:05] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: reboot to apply updates - bking@cumin1002 - T407110 [16:41:57] FIRING: [2x] ProbeDown: Service wdqs1026:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1026:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:42:25] FIRING: SystemdUnitFailed: push_cross_cluster_settings_9200.service on relforge1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:08] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Wikimedia-Mailing-lists: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11359587 (10LSobanski) p:05Triage→03Low [16:47:25] FIRING: [3x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on relforge1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:47:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [16:48:34] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: reboot to apply updates - bking@cumin1002 - T407110 [16:51:40] (03PS1) 10Muehlenhoff: test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860) [16:52:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:52:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [16:56:17] RECOVERY - MegaRAID on db1171 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:56:44] (03CR) 10JMeybohm: [C:03+1] ingress: remove reference to defunct template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196174 (https://phabricator.wikimedia.org/T406876) (owner: 10Bking) [16:57:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:57:25] FIRING: [3x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on relforge1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:58:08] (03CR) 10CI reject: [V:04-1] test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff) [16:58:19] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:58:50] (03CR) 10SBassett: [C:03+2] Security-landing-page: bump image to 2025-10-27-155537 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203489 (https://phabricator.wikimedia.org/T404996) (owner: 10Mmartorana) [16:59:50] (03PS3) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) [17:01:15] (03Merged) 10jenkins-bot: Security-landing-page: bump image to 2025-10-27-155537 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203489 (https://phabricator.wikimedia.org/T404996) (owner: 10Mmartorana) [17:02:02] FIRING: [13x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:02:24] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:02:25] RESOLVED: [3x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on relforge1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:02:46] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11359722 (10CDanis) Hi @Chandra-WMDE , seems like you posted the private key in the task instead of the public. Please stop using that key for anything, and generate a new one,... [17:06:57] FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:07:07] FIRING: [15x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:09:06] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:10:55] (03PS1) 10CDanis: admin: btullis: remove old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1203495 (https://phabricator.wikimedia.org/T409279) [17:11:57] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:12:08] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11359791 (10Jdrewniak) hi @Dzahn, I just confirmed with @cmadeo that the desired domain/path for this microsite is actually: https://www.wikipedia.org/25-years-o... [17:13:11] (03CR) 10JMeybohm: "Looks good, thanks! I would suggest to add the include to a couple of more helmfile files in order to make sure the CI change does not sta" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [17:13:22] (03CR) 10JMeybohm: "I don't think removing from general-* files will work as of now since admin_ng helmfiles ingest the value from there." [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [17:16:09] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2181 gradually with 4 steps - Migration of db2181.codfw.wmnet completed [17:16:57] FIRING: [6x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:21:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: New SSH key for Brett Cornwall - https://phabricator.wikimedia.org/T409600#11359832 (10CDanis) 05Open→03Resolved merged and fast-deployed to `A:bastion OR A:cumin` [17:21:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1203495 (https://phabricator.wikimedia.org/T409279) (owner: 10CDanis) [17:21:57] FIRING: [8x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:22:16] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:22:42] (03CR) 10Jcrespo: [C:03+2] dbbackups: Upgrade db2199, last backup source with 10.6 to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1203429 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo) [17:26:57] (03PS1) 10Muehlenhoff: Remove leftover import of python-elastic [cookbooks] - 10https://gerrit.wikimedia.org/r/1203496 (https://phabricator.wikimedia.org/T390860) [17:26:57] FIRING: [9x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:30:12] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1203495 (https://phabricator.wikimedia.org/T409279) (owner: 10CDanis) [17:31:57] FIRING: [10x] ProbeDown: Service wdqs1026:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:30] (03CR) 10CI reject: [V:04-1] Remove leftover import of python-elastic [cookbooks] - 10https://gerrit.wikimedia.org/r/1203496 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff) [17:39:45] (03Abandoned) 10Muehlenhoff: Remove leftover import of python-elastic [cookbooks] - 10https://gerrit.wikimedia.org/r/1203496 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff) [17:48:38] (03CR) 10BPirkle: [C:03+1] Change RESTbase => REST in wgRestSandboxSpecs names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (owner: 10Aaron Schulz) [17:49:26] FIRING: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:53:23] (03PS1) 10Kgraessle: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) [17:54:24] phabricator.wikimedia.org seems down? Getting: Request served via cp3070 cp3070, Varnish XID 37697385 [17:54:25] Upstream caches: cp3070 int [17:54:25] Error: 403, 02cd48e281926cca9 (0930e9c) at Mon, 10 Nov 2025 17:53:59 GMT [17:54:26] RESOLVED: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:59:36] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:00:04] swfrench-wmf: OwO what's this, a deployment window?? MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1800). nyaa~ [18:00:05] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1800). [18:01:16] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): serve 50% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203284 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:01:38] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2181 gradually with 4 steps - Migration of db2181.codfw.wmnet completed [18:01:39] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [18:01:49] !log mmartorana@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [18:02:16] !log mmartorana@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:02:49] !log mmartorana@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:03:07] !log mmartorana@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:03:10] (03Merged) 10jenkins-bot: mw-(api-ext|web): serve 50% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203284 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:03:34] !log mmartorana@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:03:54] !log mmartorana@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [18:04:05] !log mmartorana@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:04:14] !log mmartorana@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:04:26] !log mmartorana@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:04:31] !log mmartorana@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [18:04:43] (03PS1) 10Elukey: containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) [18:05:13] (03CR) 10CI reject: [V:04-1] containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [18:05:18] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:05:38] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:05:44] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:06:06] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:07:38] (03PS2) 10Elukey: containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) [18:09:38] (03PS3) 10Elukey: containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) [18:09:59] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on people1004.eqiad.wmnet with reason: decom [18:10:21] (03PS4) 10Elukey: containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) [18:10:29] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:10:34] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts people1004.eqiad.wmnet [18:10:38] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [18:10:46] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:11:04] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:11:20] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:11:46] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:11:57] RESOLVED: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:11:59] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:12:07] FIRING: [15x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:12:16] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:12:30] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:12:44] !log [WDQS] Restarted wdqs-main in codfw [18:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:00] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2090.codfw.wmnet with OS bullseye [18:14:06] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:07] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11360109 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2090.codfw.wmnet with OS bullseye [18:14:07] PROBLEM - MariaDB Replica SQL: s4 on db2199 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:14:09] PROBLEM - MariaDB read only s4 on db2199 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [18:14:13] PROBLEM - mysqld processes on db2199 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:14:31] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2091.codfw.wmnet with OS bullseye [18:14:35] PROBLEM - MariaDB Replica IO: s4 on db2199 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:14:45] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11360110 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2091.codfw.wmnet with OS bullseye [18:15:00] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2093.codfw.wmnet with OS bullseye [18:15:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11360111 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2093.codfw.wmnet with OS bullseye [18:15:27] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2094.codfw.wmnet with OS bullseye [18:15:36] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11360113 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2094.codfw.wmnet with OS bullseye [18:15:43] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:15:50] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:15:51] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:16:02] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:16:16] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:17:35] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2091.codfw.wmnet with reason: host reimage [18:17:54] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2093.codfw.wmnet with reason: host reimage [18:18:25] (03CR) 10Scott French: [C:03+2] deployment_server: migrate mw-(cron|videoscaler) to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203285 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:18:36] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [18:18:58] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [18:20:11] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:20:32] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:20:36] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:20:52] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:21:25] dzahn@cumin2002 decommission (PID 2061752) is awaiting input [18:22:02] FIRING: [15x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:22:50] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:22:58] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:23:02] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:23:14] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:23:31] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: people1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002" [18:24:01] PROBLEM - Host db2199 is DOWN: PING CRITICAL - Packet loss = 100% [18:24:55] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2091.codfw.wmnet with reason: host reimage [18:25:00] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: people1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002" [18:25:01] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:25:01] RECOVERY - Host db2199 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms [18:25:02] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts people1004.eqiad.wmnet [18:25:09] PROBLEM - MariaDB Replica SQL: s4 on db2199 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:25:11] PROBLEM - MariaDB read only s4 on db2199 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [18:25:16] ^downtime expired [18:25:17] PROBLEM - mysqld processes on db2199 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:25:18] fixing [18:25:35] PROBLEM - MariaDB Replica IO: s4 on db2199 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:26:09] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11360194 (10Dzahn) a:05Dzahn→03SKaram-WMF [18:26:17] RECOVERY - mysqld processes on db2199 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:26:35] RECOVERY - MariaDB Replica IO: s4 on db2199 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:27:02] FIRING: [15x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:27:07] RECOVERY - MariaDB Replica SQL: s4 on db2199 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:27:11] RECOVERY - MariaDB read only s4 on db2199 is OK: Version 10.11.14-MariaDB-log, Uptime 64s, read_only: True, event_scheduler: True, 3987.95 QPS, connection latency: 0.028184s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [18:27:52] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [18:28:52] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2093.codfw.wmnet with reason: host reimage [18:29:47] !log swfrench@deploy2002 Started scap sync-world: No-deployment scap run to switch mw-(cron|videoscaler) to PHP 8.3 - T405955 [18:29:51] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:30:04] !log swfrench@deploy2002 Stopping before sync operations [18:32:02] FIRING: [14x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:32:37] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [18:33:07] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [18:33:53] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11360252 (10Dzahn) The first space character separates the key from the comment field. It should work with or without the comment field though. To debug I recommend first v... [18:34:20] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [18:34:49] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [18:35:22] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts people2003.codfw.wmnet [18:39:21] (03PS1) 10Dzahn: site: remove decom'ed people bookworm machines [puppet] - 10https://gerrit.wikimedia.org/r/1203502 (https://phabricator.wikimedia.org/T408713) [18:40:19] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:42:02] FIRING: [12x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:42:38] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [18:43:08] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [18:44:01] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: people2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002" [18:44:28] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: people2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002" [18:44:29] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:44:30] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts people2003.codfw.wmnet [18:45:12] !log destroyed former people.wikimedia.org backends people1004/people2003 - replaced by trixie VMs people1005/people2004 [18:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:52] (03CR) 10Dzahn: [C:03+2] site: remove decom'ed people bookworm machines [puppet] - 10https://gerrit.wikimedia.org/r/1203502 (https://phabricator.wikimedia.org/T408713) (owner: 10Dzahn) [18:46:59] (03PS2) 10Dzahn: site: remove decom'ed people bookworm machines [puppet] - 10https://gerrit.wikimedia.org/r/1203502 (https://phabricator.wikimedia.org/T408713) [18:47:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:49:16] (03CR) 10Dzahn: [C:03+2] site: remove decom'ed people bookworm machines [puppet] - 10https://gerrit.wikimedia.org/r/1203502 (https://phabricator.wikimedia.org/T408713) (owner: 10Dzahn) [18:50:04] (03PS4) 10Kamila Součková: mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) [18:50:48] (03CR) 10Kamila Součková: "Adding to a couple more helmfiles done, let's see what CI thinks :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [18:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:51:26] FIRING: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:52:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:52:11] (03CR) 10Aaron Schulz: Change RESTbase => REST in wgRestSandboxSpecs names (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (owner: 10Aaron Schulz) [18:54:18] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2090.codfw.wmnet with reason: host reimage [18:56:26] RESOLVED: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:57:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:58:36] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2090.codfw.wmnet with reason: host reimage [19:01:42] 06SRE, 06Traffic: Meta query about why we map 31.13.103.0/24 to US - https://phabricator.wikimedia.org/T409735#11360362 (10ssingh) Thanks for filing this task @cmooney! The geofeed link above is very helpful. So it seems from the above (57.141.8.0/24, 57.141.8.0/24), we are missing the entries in the geo-maps... [19:02:02] RESOLVED: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:04:01] (03CR) 10CI reject: [V:04-1] mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [19:11:24] andrew@cumin2002 reimage (PID 2064112) is awaiting input [19:11:57] (03PS4) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) [19:16:44] (03PS5) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) [19:19:34] (03PS3) 10Arlolra: Deploy Parsoid Read Views to 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593) [19:20:00] (03CR) 10Arlolra: Deploy Parsoid Read Views to 13 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593) (owner: 10Arlolra) [19:20:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593) (owner: 10Arlolra) [19:25:57] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11360447 (10Dzahn) Hello @Jdrewniak Do you really mean wikiPedia.org or wikiMedia.org? Just wanted to double check first because the config you link to is actually... [19:32:59] (03PS2) 10Daniel Kinzler: rest-gateway: enable rate limits on some routes in shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202658 (https://phabricator.wikimedia.org/T406498) [19:35:47] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2094.codfw.wmnet with OS bullseye [19:35:55] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11360494 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-be2094.codfw.wmnet with OS bullseye execute... [19:36:25] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11360495 (10Geagea) Now again VRT number - 17 digits 20251110103208628 20251110103208173 [19:44:40] (03PS2) 10CDanis: admin: deployment: add volker-e & rotate his ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1203488 (https://phabricator.wikimedia.org/T406243) [19:46:24] (03CR) 10CDanis: [C:03+2] "Confirmed via Slack DM" [puppet] - 10https://gerrit.wikimedia.org/r/1203488 (https://phabricator.wikimedia.org/T406243) (owner: 10CDanis) [19:46:39] (03CR) 10Ssingh: [C:03+1] base: add bat (batcat) to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1203462 (owner: 10CDanis) [19:46:53] (03CR) 10CDanis: [C:03+2] base: add bat (batcat) to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1203462 (owner: 10CDanis) [19:47:56] 06SRE, 10LDAP-Access-Requests, 06Research, 10Research-collaborations: Hourly pageview data request — Splitsville (2025) and related indie-film Wikipedia pages - https://phabricator.wikimedia.org/T409639#11360548 (10A_smart_kitten) →14Duplicate dup:03T409676 [19:54:15] !log removing 2 files for legal compliance [19:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:15] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11360557 (10CDanis) 05Open→03Resolved a:03CDanis [19:58:25] (03CR) 10Muehlenhoff: "Unless it's also available on Buster this would break Puppet on the puppetmaster* nodes, though?" [puppet] - 10https://gerrit.wikimedia.org/r/1203462 (owner: 10CDanis) [19:59:03] !log removing 1 file for legal compliance [19:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:01] (03PS1) 10CDanis: base: no batcat in <=buster [puppet] - 10https://gerrit.wikimedia.org/r/1203512 [20:04:03] (03CR) 10Ssingh: [C:03+1] base: no batcat in <=buster [puppet] - 10https://gerrit.wikimedia.org/r/1203512 (owner: 10CDanis) [20:05:11] (03CR) 10CDanis: [C:03+2] base: no batcat in <=buster [puppet] - 10https://gerrit.wikimedia.org/r/1203512 (owner: 10CDanis) [20:09:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T409022) (owner: 10Pppery) [20:21:23] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [20:25:50] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [20:27:11] (03PS5) 10Kamila Součková: mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) [20:37:11] (03PS1) 10Mstyles: OATHAuth: Increase 2FA opt-in to 60% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203535 (https://phabricator.wikimedia.org/T399664) [20:38:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203535 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [20:38:27] (03CR) 10CI reject: [V:04-1] mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [20:38:38] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11360708 (10Andrew) On @fgiunchedi's request I tried dd'ing every drive on a server before reimaging but grub still exhibits the issue. [20:46:20] (03PS2) 10Mstyles: OATHAuth: Increase 2FA opt-in to 70% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203535 (https://phabricator.wikimedia.org/T399664) [20:50:09] (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [20:57:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482 (owner: 10Jdlrobson) [21:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T2100). [21:00:04] RoanKattouw, toyofuku, aude, arlolra, Pppery, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:17] I'll go last :) [21:01:44] with spiderpig, how do the backports work now? [21:02:00] does a deploy do all the patches still? [21:02:03] deployer [21:02:27] You can use Spiderpig yourself if you have access [21:02:41] everyone does their own patch? [21:02:43] Otherwise the deployer will do it for you... and they'll just use Spiderpig themselves anyway [21:02:56] If they can, usually yes (not everyone has Spiderpig access) [21:02:59] ok, deploying my patch [21:03:12] Among other things, I'm here and I don't have spiderpig access [21:03:50] (03PS1) 10Kosta Harlan: hCaptcha: Use FancyCaptcha for API edits and page creations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203547 (https://phabricator.wikimedia.org/T405595) [21:03:59] Pppery: I'll do yours after aude is done [21:05:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy2002 using scap backport" [extensions/ReadingLists] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203139 (https://phabricator.wikimedia.org/T409116) (owner: 10Stoyofuku-wmf) [21:07:16] I don't think it's possible to test mine without actually setting up Tor - do people want me to try to do that or are they willing to deploy without testing [21:09:06] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:09:07] I'm happy to deploy that without fully testing it [21:09:16] OK [21:09:40] I'm scheduling this for deployment on behalf of the community, not because I have a personal stake in it [21:09:40] I would suggest getting someone who does use Tor to test it later though, to verify that the deploy did what you expected it to do [21:09:58] Will do [21:10:24] But when Spiderpig asks me to test the change before it continues the deployment, I'm just going to check that the site still works and then hit continue [21:12:04] (03PS1) 10CDanis: A modest proposal: run oomd on stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1203548 [21:12:14] is jenkins normally this slow? [21:12:48] (03Merged) 10jenkins-bot: Use addModuleStyles for ReadingList icons [extensions/ReadingLists] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203139 (https://phabricator.wikimedia.org/T409116) (owner: 10Stoyofuku-wmf) [21:13:06] !log aude@deploy2002 Started scap sync-world: Backport for [[gerrit:1203139|Use addModuleStyles for ReadingList icons (T409116)]] [21:13:10] T409116: Move ReadingList/Collections icon up in the loading module sequence - https://phabricator.wikimedia.org/T409116 [21:14:55] (03PS3) 10Jdlrobson: Drop references to removed Advanced mobile contribution configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470) [21:15:03] (03CR) 10Jdlrobson: [C:03+1] Drop references to removed Advanced mobile contribution configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470) (owner: 10Jdlrobson) [21:15:10] That was only 7 minutes, that's not that slow for an extension change. For config changes it's much faster, but for gated extensions it's slower [21:15:16] !log aude@deploy2002 toyofuku, aude: Backport for [[gerrit:1203139|Use addModuleStyles for ReadingList icons (T409116)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:16:57] !log aude@deploy2002 toyofuku, aude: Continuing with sync [21:17:55] andrew@cumin2002 reimage (PID 2110470) is awaiting input [21:21:22] !log aude@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203139|Use addModuleStyles for ReadingList icons (T409116)]] (duration: 08m 16s) [21:21:26] T409116: Move ReadingList/Collections icon up in the loading module sequence - https://phabricator.wikimedia.org/T409116 [21:21:29] i'm done [21:22:54] (03PS1) 10Scott French: Minor usability improvements for known-client objects [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1203550 [21:25:16] (03CR) 10Scott French: [V:03+2] "Tested locally at `17556f9`" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1203550 (owner: 10Scott French) [21:25:33] (03CR) 10Scott French: [V:03+2 C:03+2] Minor usability improvements for known-client objects [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1203550 (owner: 10Scott French) [21:25:49] who is next? [21:25:59] I think I am [21:26:05] Yes I'll do your patch now [21:26:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T409022) (owner: 10Pppery) [21:27:01] arlolra: After that, would you like to deploy your own patch, or would you like me to do it for you? [21:27:07] (03Merged) 10jenkins-bot: Remove extended autoconfirmed time for Tor on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T409022) (owner: 10Pppery) [21:27:13] I can take care of it, thanks [21:27:16] Great thanks [21:27:17] !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy: Minor usability improvements for known-client objects - swfrench@cumin2002" [21:27:19] !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Minor usability improvements for known-client objects - swfrench@cumin2002 [21:27:26] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1200743|Remove extended autoconfirmed time for Tor on enwiki (T409022)]] [21:27:28] You can go after this one, and then we have Jon's patch, and then my patches [21:27:29] T409022: Remove extended autoconfirmed time for tor users on enwiki - https://phabricator.wikimedia.org/T409022 [21:28:09] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Minor usability improvements for known-client objects - swfrench@cumin2002 [21:28:11] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy: Minor usability improvements for known-client objects - swfrench@cumin2002" [21:28:23] Ok [21:29:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11360925 (10Jclark-ctr) 05Open→03Resolved Idrac is showing SYSTEM IS HEALTHY after rebuilding. [21:30:08] !log catrope@deploy2002 catrope, pppery: Backport for [[gerrit:1200743|Remove extended autoconfirmed time for Tor on enwiki (T409022)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:31:10] !log catrope@deploy2002 catrope, pppery: Continuing with sync [21:32:13] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 457714272 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:32:32] I actually decided to fully test this anyway. I can confirm it works [21:32:42] Great, thank you! [21:33:35] Getting Tor running was much smoother than I thought it would be, and I could exploit a bug/misfeature in TorBlock where it applies to enhanced autoconfirmed standards to every user as seen in their UserRights page, not just you, to avoid having to set up a test account [21:35:45] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200743|Remove extended autoconfirmed time for Tor on enwiki (T409022)]] (duration: 08m 19s) [21:35:49] T409022: Remove extended autoconfirmed time for tor users on enwiki - https://phabricator.wikimedia.org/T409022 [21:36:06] arlolra: Your turn [21:36:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593) (owner: 10Arlolra) [21:36:29] Wow deployments have gotten a lot faster lately! (cc swfrench-wmf ) [21:36:59] :) [21:37:13] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 16552 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:37:16] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593) (owner: 10Arlolra) [21:37:34] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1203142|Deploy Parsoid Read Views to 13 wikis (T409593)]] [21:37:38] T409593: Parsoid Read Views to deploy ~2025-11-10 - https://phabricator.wikimedia.org/T409593 [21:39:31] we've done a bit of tuning to make the prod deployment step a bit faster despite some of the awkwardness around the ongoing PHP migration. [21:39:32] that said, a deployment that incurs a full image build (e.g., due to l10n updates), will still be rather slow [21:39:44] (03PS1) 10BryanDavis: wikitech: Enable page protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203552 (https://phabricator.wikimedia.org/T409785) [21:39:57] Yeah I'll get to experience that in a little bit, I have an i18n change that I'm backporting (at the end of this window so as to not inconvenience others) [21:40:16] !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1203142|Deploy Parsoid Read Views to 13 wikis (T409593)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:40:20] good call scheduling that at the end! [21:41:31] !log arlolra@deploy2002 arlolra: Continuing with sync [21:43:28] wikitech doesn't have a history of on-wiki discussion for config changes, so I jumped right to a phab task (T409785) and gerrit patch for enabling the new protection indicators from core. Comment on either if you have an argument against turning this on. [21:43:29] T409785: Enable protection indicators for wikitech - https://phabricator.wikimedia.org/T409785 [21:44:34] @RoanKattouw are you using the security window after or is it okay if this deploy window goes over a little? [21:44:42] my config changes should be relatively quick and can go out together [21:44:44] (03PS2) 10BryanDavis: wikitech: Enable page protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203552 (https://phabricator.wikimedia.org/T409785) [21:44:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11361008 (10Jclark-ctr) replaced failed drive bay 4. idrac also now has allert for A predictive failure detected on drive 0 in disk... [21:45:23] Jdlrobson: I'll do your config changes first and then my time-consuming i18n change [21:45:35] That way I'm only inconveniencing myself with the security window [21:45:48] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203142|Deploy Parsoid Read Views to 13 wikis (T409593)]] (duration: 08m 14s) [21:45:52] T409593: Parsoid Read Views to deploy ~2025-11-10 - https://phabricator.wikimedia.org/T409593 [21:46:01] RoanKattouw: back to you [21:46:26] Jdlrobson: You said changes plural? Is there more than just https://gerrit.wikimedia.org/r/c/1199482/ ? [21:47:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482 (owner: 10Jdlrobson) [21:48:28] (03Merged) 10jenkins-bot: Update QuickSurvey platforms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482 (owner: 10Jdlrobson) [21:48:46] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1199482|Update QuickSurvey platforms]] [21:49:16] (03CR) 10BryanDavis: "I announced this in a couple of irc channels in case someone has a reason to oppose it. I kind of think we can be WP:BOLD and deploy whene" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203552 (https://phabricator.wikimedia.org/T409785) (owner: 10BryanDavis) [21:50:42] RoanKattouw: yeh i'd like to land https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1200173?usp=search as well if I can [21:51:02] (unused config code) [21:51:05] !log catrope@deploy2002 catrope, jdlrobson: Backport for [[gerrit:1199482|Update QuickSurvey platforms]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:51:10] OK I'll do that one next [21:51:19] (03CR) 10Lucas Werkmeister: [C:03+1] wikitech: Enable page protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203552 (https://phabricator.wikimedia.org/T409785) (owner: 10BryanDavis) [21:51:20] sorry i missed you +2ed my change already [21:51:24] Jdlrobson: Could you test your QuickSurveys patch? [21:51:29] yep on it now [21:53:24] lgtm RoanKattouw [21:53:32] !log catrope@deploy2002 catrope, jdlrobson: Continuing with sync [21:57:48] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199482|Update QuickSurvey platforms]] (duration: 09m 02s) [21:59:57] (03PS2) 10BryanDavis: wikitech: Put indicators in title with vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201816 [22:00:05] Reedy, sbassett, Maryum, and manfredi: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T2200). [22:00:05] (03CR) 10CI reject: [V:04-1] wikitech: Put indicators in title with vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201816 (owner: 10BryanDavis) [22:03:58] I have three security patches to deploy [22:04:14] Go ahead, I'll finish the rest of the backports after you're done [22:05:32] maryum: just flagging the comment at https://phabricator.wikimedia.org/T407157#11361165 made a few mins ago [22:05:33] maryum: please see T407157#11361165 in case you're planning to deploy that [22:05:35] oh [22:05:35] (03PS3) 10BryanDavis: wikitech: Put indicators in title with vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201816 [22:05:40] lol, great minds think alike [22:06:28] I was planning to deploy that SomeRandomDev A_smart_kitten [22:07:01] does that mean that patch can't go out since the core MR is still open? [22:07:09] yes [22:07:45] okay I'll check back Thursday which is the next window [22:07:53] alright, thanks [22:08:22] appreciate the heads up [22:12:11] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036#11361187 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [22:14:06] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:17:33] SomeRandomDev when I try to apply your alternative3 patch for T406664, it's not working. I'll leave a comment there [22:17:53] it's not mine, but I can take a look [22:18:27] thanks [22:18:38] yep I just realized you commented on it but didn't write it [22:21:40] (03PS3) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [22:30:30] (03PS4) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [22:32:59] preparing to run scap [22:34:06] (03PS1) 10Scott French: deployment_server: fully migrate mw-(api-ext|web) to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203559 (https://phabricator.wikimedia.org/T405955) [22:36:57] scap is running [22:46:05] scap is finished [22:46:10] !log Deployed fix for T406664 and T401053 [22:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:18] maryum: Are you all done? [22:50:24] yes [22:50:31] Great, then I'll jump back in [22:50:37] enjoy [22:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:55:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203097 (https://phabricator.wikimedia.org/T399749) (owner: 10Catrope) [22:55:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/OATHAuth] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203126 (owner: 10Catrope) [22:55:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203535 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [22:56:11] RoanKattouw: are you still able to do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1200173 ? [22:56:16] (03Merged) 10jenkins-bot: OATHAuth: Increase 2FA opt-in to 70% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203535 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [22:56:17] or can i do that quickly? [22:56:20] Yes I'll do that next [22:56:23] thx! [22:56:55] (03CR) 10Btullis: [C:03+2] Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [23:08:05] (03Merged) 10jenkins-bot: i18n: Update wikimedia-emailauth-login-help to link to Special:AccountRecovery [extensions/WikimediaMessages] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203097 (https://phabricator.wikimedia.org/T399749) (owner: 10Catrope) [23:08:06] (03Merged) 10jenkins-bot: OATHManage: Don't always set the page title to "Create new recovery codes" [extensions/OATHAuth] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203126 (owner: 10Catrope) [23:08:28] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1203097|i18n: Update wikimedia-emailauth-login-help to link to Special:AccountRecovery (T399749)]], [[gerrit:1203126|OATHManage: Don't always set the page title to "Create new recovery codes"]], [[gerrit:1203535|OATHAuth: Increase 2FA opt-in to 70% of users (T399664)]] [23:08:33] T399749: Link to Zendesk form from EmailAuth failure message - https://phabricator.wikimedia.org/T399749 [23:08:34] T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664 [23:10:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11361376 (10BTullis) 05Open→03Resolved It's all done now. Apologies for the delay in getting to this. [23:16:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11361406 (10BTullis) >>! In T408065#11361008, @Jclark-ctr wrote: > replaced failed drive bay 4. idrac also now has allert for A pred... [23:17:07] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1203 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [23:34:11] !log catrope@deploy2002 catrope, mstyles: Backport for [[gerrit:1203097|i18n: Update wikimedia-emailauth-login-help to link to Special:AccountRecovery (T399749)]], [[gerrit:1203126|OATHManage: Don't always set the page title to "Create new recovery codes"]], [[gerrit:1203535|OATHAuth: Increase 2FA opt-in to 70% of users (T399664)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can no [23:34:11] w be verified there. [23:34:16] T399749: Link to Zendesk form from EmailAuth failure message - https://phabricator.wikimedia.org/T399749 [23:34:16] T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664 [23:37:12] !log catrope@deploy2002 catrope, mstyles: Continuing with sync [23:39:11] FIRING: [2x] PfwCoreBGPDown: Fundraising Firewall core BGP session down between pfw1-codfw and (null) (10.195.0.248) - group VPN - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [23:39:11] FIRING: [2x] PfwCoreBGPDown: Fundraising Firewall core BGP session down between pfw1-codfw and (null) (10.195.0.248) - group VPN - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [23:39:46] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860 [23:39:50] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [23:55:52] My deploy just failed after running for an hour :( https://spiderpig.wikimedia.org/jobs/887 [23:56:07] `context deadline exceeded` well then [23:56:12] RoanKattouw: that is very odd ... it looks like _only_ mw-wikifunctions timed out? [23:56:20] and that triggered everything to roll back =/ [23:56:26] Yeah it rolled back everything [23:56:32] I'll take a quick look [23:56:44] the good news is that retrying will be _much_ faster [23:56:48] Great [23:56:55] Would you like me to kick off that retry now? [23:57:03] Or would you like some time to take a look first? [23:57:07] if it would be alright, give me a sec to see if I can sort out what happened [23:57:59] OK take your time