[00:10:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[00:15:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[00:37:31] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1203283
[00:37:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1203283 (owner: 10TrainBranchBot)
[00:47:57] <wikibugs>	 (03PS4) 10Scott French: hieradata: pilot cfssl/pki for etcd on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1182658 (https://phabricator.wikimedia.org/T352245)
[00:48:15] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182658 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[00:51:20] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1203283 (owner: 10TrainBranchBot)
[00:54:21] <wikibugs>	 (03PS1) 10Scott French: deployment_server: migrate mw-(cron|videoscaler) to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203285 (https://phabricator.wikimedia.org/T405955)
[00:54:23] <wikibugs>	 (03PS1) 10Scott French: mw-(api-ext|web): serve 50% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203284 (https://phabricator.wikimedia.org/T405955)
[01:00:41] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:07:19] <wikibugs>	 (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203286 (https://phabricator.wikimedia.org/T402389)
[01:07:38] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1203287
[01:07:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1203287 (owner: 10TrainBranchBot)
[01:09:06] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:14:58] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 14m 17s)
[01:29:47] <wikibugs>	 (03CR) 10STran: [C:03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203286 (https://phabricator.wikimedia.org/T402389) (owner: 10STran)
[01:30:35] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1203287 (owner: 10TrainBranchBot)
[01:31:51] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203286 (https://phabricator.wikimedia.org/T402389) (owner: 10STran)
[01:33:22] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:36:49] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[01:38:22] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:38:52] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[01:41:43] <logmsgbot>	 !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[01:42:12] <logmsgbot>	 !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[01:42:15] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[01:42:40] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[01:52:55] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-d6-eqiad (10.64.128.29) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[02:18:22] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:38:22] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:51:12] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[02:52:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[02:57:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[02:58:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:03:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:21:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:26:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:27:06] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[03:37:12] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:01:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:04:18] <wikibugs>	 06SRE, 10Gerrit, 10observability, 06Release-Engineering-Team (Radar): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086#11357128 (10Pppery)
[05:08:22] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:09:06] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:13:30] <wikibugs>	 06SRE, 06Traffic-Icebox, 13Patch-Needs-Improvement: Preserve Server response header when generating custom error page via VCL - https://phabricator.wikimedia.org/T285926#11357159 (10Pppery)
[05:33:22] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:39:06] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:44:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[05:49:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[05:52:55] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-d6-eqiad (10.64.128.29) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[06:03:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] deployment_server: migrate mw-(cron|videoscaler) to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203285 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[06:18:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-(api-ext|web): serve 50% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203284 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[06:31:46] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555)
[06:32:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[06:34:47] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555)
[06:39:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[06:51:12] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:27:06] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[07:27:30] <wikibugs>	 (03PS1) 10KartikMistry: apertium: staging: Update to 2025-11-10-034557-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203296 (https://phabricator.wikimedia.org/T408515)
[07:31:02] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498)
[07:31:07] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498)
[07:32:02] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555)
[07:32:02] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555)
[07:32:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[07:33:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler)
[07:33:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler)
[07:37:12] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:58:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for stevemunene [puppet] - 10https://gerrit.wikimedia.org/r/1203299
[07:59:57] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache/haproxy: set x-trusted-request to D for UA-compliant robots [puppet] - 10https://gerrit.wikimedia.org/r/1203054 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto)
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:03:53] <wikibugs>	 (03PS1) 10Slyngshede: IDP: Switch to CAS 7.2 [dns] - 10https://gerrit.wikimedia.org/r/1203309 (https://phabricator.wikimedia.org/T406455)
[08:04:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove access for stevemunene [puppet] - 10https://gerrit.wikimedia.org/r/1203299 (owner: 10Muehlenhoff)
[08:07:43] <logmsgbot>	 !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Stevemunene out of all services on: 2395 hosts
[08:19:34] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[08:19:52] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[08:21:28] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[08:24:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Steve from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/1203375
[08:29:27] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] cache::text: introduce rate-limits by traffic class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[08:42:50] <wikibugs>	 (03PS1) 10Vgutierrez: external_clouds_vendors: Add CCBot [puppet] - 10https://gerrit.wikimedia.org/r/1203377
[08:44:25] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11357368 (10Krd) The bounces queue is at 292k now, and increasing. Please have a look.
[08:45:12] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11357370 (10Krd) {F70070832}
[08:45:13] <wikibugs>	 (03PS1) 10Brouberol: airflow: migrate to the image defined in the airflow-dags repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203379 (https://phabricator.wikimedia.org/T408711)
[08:45:15] <wikibugs>	 (03PS1) 10Brouberol: airflow: deploy the image tested on test-k8s to all instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203380 (https://phabricator.wikimedia.org/T408711)
[08:47:49] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1203375 (owner: 10Muehlenhoff)
[08:48:02] <moritzm>	 !log uploaded openjdk-8 8u472-ga-1~deb12u1 to apt.wikimedia.org (forward port of latest Java 8 security updates)
[08:48:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:11] <moritzm>	 !log installing Java 8 security updates on Bookworm
[08:48:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:18] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[08:52:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11357385 (10MoritzMuehlenhoff)
[08:52:35] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1203377 (owner: 10Vgutierrez)
[08:52:51] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] external_clouds_vendors: Add CCBot [puppet] - 10https://gerrit.wikimedia.org/r/1203377 (owner: 10Vgutierrez)
[08:53:51] <wikibugs>	 (03CR) 10David Caro: [C:03+1] hieradata: Remove obsolete haproxy_exporter settings [puppet] - 10https://gerrit.wikimedia.org/r/1202997 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah)
[08:54:20] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Remove obsolete haproxy_exporter settings [puppet] - 10https://gerrit.wikimedia.org/r/1202997 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah)
[08:55:17] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Route transform/wikitext/to/lint(.*) to the gateway on group1 [puppet] - 10https://gerrit.wikimedia.org/r/1194995 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz)
[09:00:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove Steve from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/1203375 (owner: 10Muehlenhoff)
[09:00:39] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Route /page/lint(.*) to the gateway on group0 [puppet] - 10https://gerrit.wikimedia.org/r/1199033 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz)
[09:01:26] <claime>	 moritzm: ok to merge?
[09:03:48] <wikibugs>	 (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203379 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[09:05:07] <wikibugs>	 (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203380 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[09:05:50] <claime>	 moritzm: Assuming I can merge since it's just a user right change
[09:06:07] <moritzm>	 claime: yes, please. I had been attempting to merge, but the lock was held by you
[09:06:09] <wikibugs>	 (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202995 (owner: 10Brouberol)
[09:06:23] <claime>	 moritzm: hah! Merged :D
[09:06:27] <moritzm>	 thanks!
[09:08:48] <wikibugs>	 (03PS4) 10Brouberol: dse-k8s-worker: allow stat boxes to egress to the growthbook PG service [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591)
[09:09:06] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:10:55] <wikibugs>	 (03PS5) 10Brouberol: dse-k8s-worker: allow stat boxes to egress to the growthbook PG service [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591)
[09:12:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690 (10fgiunchedi) 03NEW
[09:13:15] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: enable ceph-csi-cephfs in the growthbook namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202995 (owner: 10Brouberol)
[09:13:23] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: migrate to the image defined in the airflow-dags repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203379 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[09:13:25] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: deploy the image tested on test-k8s to all instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203380 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[09:13:52] <wikibugs>	 (03CR) 10Gehel: [C:04-1] "minor comment on documentation" [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol)
[09:15:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11357477 (10fgiunchedi) >>! In T399180#11310972, @cmooney wrote: >>>! In T399180#11310845, @fgiunchedi wrote: >> @taavi @Andrew @cmooney what do you think of the above?  >...
[09:15:49] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11357479 (10jcrespo) Yes, please, dc ops, file a servicing request or help us with a spare here.
[09:15:50] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: migrate to the image defined in the airflow-dags repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203379 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[09:16:09] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: deploy the image tested on test-k8s to all instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203380 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[09:17:42] <wikibugs>	 (03CR) 10Brouberol: dse-k8s-worker: allow stat boxes to egress to the growthbook PG service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol)
[09:17:44] <wikibugs>	 (03PS6) 10Brouberol: dse-k8s-worker: allow stat boxes to egress to the growthbook PG service [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591)
[09:18:14] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cloudcephosd: switch 1048 to single interface [puppet] - 10https://gerrit.wikimedia.org/r/1203383 (https://phabricator.wikimedia.org/T399180)
[09:18:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cloudcephosd: switch 1049 to single interface [puppet] - 10https://gerrit.wikimedia.org/r/1203384 (https://phabricator.wikimedia.org/T399180)
[09:22:24] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:23:00] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:25:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "My tests on idp1005 were all fine" [dns] - 10https://gerrit.wikimedia.org/r/1203309 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede)
[09:26:17] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11357512 (10jcrespo) For Robh, a bit of a background on the requirements for production dbs, from a backup perspective, so he has the global undestanding of our aim. Databases rar...
[09:27:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove historic comments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202127 (owner: 10Muehlenhoff)
[09:29:20] <wikibugs>	 (03CR) 10Btullis: [C:03+1] dse-k8s-worker: allow stat boxes to egress to the growthbook PG service [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol)
[09:30:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] osm_master: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1199005 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:32:59] <wikibugs>	 (03PS1) 10Brouberol: airflow-analytics-test: use the common airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203387 (https://phabricator.wikimedia.org/T408711)
[09:33:28] <wikibugs>	 (03CR) 10Btullis: [C:03+1] growthbook: define configuration for local file uploads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol)
[09:33:51] <wikibugs>	 (03PS2) 10Muehlenhoff: osm_replica: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1202664 (https://phabricator.wikimedia.org/T381565)
[09:34:01] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: migrate to the image defined in the airflow-dags repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203379 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[09:34:24] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: deploy the image tested on test-k8s to all instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203380 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[09:34:51] <wikibugs>	 (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: enable ceph-csi-cephfs in the growthbook namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202995 (owner: 10Brouberol)
[09:35:43] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] dse-k8s-worker: allow stat boxes to egress to the growthbook PG service [puppet] - 10https://gerrit.wikimedia.org/r/1203381 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol)
[09:36:50] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-analytics-test: use the common airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203387 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[09:37:35] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1202664 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:39:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: use the common airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203387 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[09:39:06] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:39:07] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] growthbook: define configuration for local file uploads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol)
[09:48:16] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[09:49:03] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[09:50:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix cumin alias for maps [puppet] - 10https://gerrit.wikimedia.org/r/1203390 (https://phabricator.wikimedia.org/T381565)
[09:50:20] <wikibugs>	 (03CR) 10Clément Goubert: [C:04-1] "The CI fails because of this error:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler)
[09:52:55] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-d6-eqiad (10.64.128.29) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:57:24] <wikibugs>	 (03PS1) 10Dpogorzelski: aya-llm: fix tolerations and affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203396 (https://phabricator.wikimedia.org/T403697)
[09:58:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[09:59:26] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[09:59:34] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[09:59:45] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[09:59:56] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:01:57] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:02:37] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDP: Switch to CAS 7.2 [dns] - 10https://gerrit.wikimedia.org/r/1203309 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede)
[10:03:17] <slyngs>	 !log Upgrade CAS (idp.wikimedia.org) to version 7.2.7
[10:03:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:43] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[10:04:02] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:04:38] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[10:04:44] <wikibugs>	 (03PS1) 10Majavah: hieradata: Enable jumbo frames on all codwf1dev cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1203398 (https://phabricator.wikimedia.org/T330075)
[10:04:46] <wikibugs>	 (03PS1) 10Majavah: hieradata: Enable jumbo frames on codfw1dev cloudnets [puppet] - 10https://gerrit.wikimedia.org/r/1203399 (https://phabricator.wikimedia.org/T330075)
[10:04:48] <wikibugs>	 (03PS1) 10Majavah: hieradata: Enable jumbo frames on remaining codfw1dev nodes [puppet] - 10https://gerrit.wikimedia.org/r/1203400 (https://phabricator.wikimedia.org/T330075)
[10:06:57] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:08:00] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1203398 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah)
[10:08:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[10:08:47] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Enable jumbo frames on all codwf1dev cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1203398 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah)
[10:09:02] <jinxer-wm>	 FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:10:18] <wikibugs>	 (03PS4) 10Esanders: Enable DiscussionTools visual enhancements everywhere except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133975 (https://phabricator.wikimedia.org/T379264)
[10:10:41] <wikibugs>	 (03CR) 10Elukey: [C:03+1] aya-llm: fix tolerations and affinity (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203396 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski)
[10:11:59] <wikibugs>	 (03PS2) 10Dpogorzelski: ml-services: fix tolerations and affinity for aya-llm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203396 (https://phabricator.wikimedia.org/T403697)
[10:12:21] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-services: fix tolerations and affinity for aya-llm (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203396 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski)
[10:14:02] <jinxer-wm>	 FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:14:14] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: fix tolerations and affinity for aya-llm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203396 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski)
[10:15:19] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707 (10Chandra-WMDE) 03NEW
[10:16:19] <wikibugs>	 (03CR) 10Clément Goubert: [C:04-1] Note that per-route rate limits require Envoy 1.33 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 (owner: 10Daniel Kinzler)
[10:16:57] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:18:22] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[10:23:22] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:23:37] <wikibugs>	 (03CR) 10Vgutierrez: "code looks good, please fix the indentation issues mentioned in the inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz)
[10:24:06] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11357869 (10AndrewTavis_WMDE)
[10:26:21] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:28:22] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:33:35] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Drop trusted proxies support [puppet] - 10https://gerrit.wikimedia.org/r/1203401
[10:34:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:38:05] <wikibugs>	 (03PS1) 10Majavah: P:openstack: neutron: Enable jumbo frames in codfw1dev Neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544)
[10:44:25] <wikibugs>	 (03PS2) 10Majavah: P:openstack: neutron: Enable jumbo frames in codfw1dev Neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544)
[10:49:48] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7591/co" [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah)
[10:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:51:13] <wikibugs>	 (03CR) 10Clément Goubert: "Yeah that's half a worker per pod. Checking thanos https://w.wiki/G23r afaict right now there would only be 3 workers, all in `eqiad`, tha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) (owner: 10KartikMistry)
[10:52:38] <wikibugs>	 (03PS1) 10Tiziano Fogli: metamonitoring/icinga/ext-mon: add dummy basic auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1203405 (https://phabricator.wikimedia.org/T397003)
[10:52:45] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga/ext-mon: add dummy basic auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1203405 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli)
[10:52:47] <wikibugs>	 (03CR) 10Tiziano Fogli: [V:03+2 C:03+2] metamonitoring/icinga/ext-mon: add dummy basic auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1203405 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli)
[10:58:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11357931 (10Nahid) 05Resolved→03Open Hey all - Thanks for attending this task. I am re-opening the task but please let me know if it needs a new ticket. Sarah is having...
[10:59:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for khernandez [puppet] - 10https://gerrit.wikimedia.org/r/1203407
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1100)
[11:00:17] <wikibugs>	 (03CR) 10Clément Goubert: "Yep that's exactly that." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga)
[11:00:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for khernandez [puppet] - 10https://gerrit.wikimedia.org/r/1203407 (owner: 10Muehlenhoff)
[11:04:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet
[11:06:07] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest-gateway: define catch-all rate limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202998 (https://phabricator.wikimedia.org/T409543) (owner: 10Daniel Kinzler)
[11:08:25] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) (owner: 10Pmiazga)
[11:08:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet
[11:09:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet
[11:10:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet
[11:10:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11357966 (10MoritzMuehlenhoff) @RobH ganeti1024 and ganeti1033 are drained and can be migrated.
[11:12:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11357969 (10MoritzMuehlenhoff)
[11:14:14] <wikibugs>	 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712 (10jcrespo) 03NEW
[11:18:04] <wikibugs>	 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11357982 (10jcrespo)
[11:20:01] <wikibugs>	 (03PS1) 10Silvan Heintze: Report integrity metric from wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482)
[11:20:01] <wikibugs>	 (03CR) 10Silvan Heintze: "As discussed: for this to work, an added network policy is needed to allow access from the kubernates pods to the push gateway." [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze)
[11:21:14] <wikibugs>	 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11357987 (10jcrespo)
[11:21:57] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:23:22] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:23:47] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: (Re-)Add monitoring for the internal Ganeti certs - https://phabricator.wikimedia.org/T382902#11358007 (10MoritzMuehlenhoff) 05Open→03Resolved I've added a new Prometheus exporter to all Ganeti nodes (which only runs on the masters), which detects the re...
[11:24:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:25:09] <wikibugs>	 (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198935 (https://phabricator.wikimedia.org/T408223)
[11:25:32] <wikibugs>	 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11358012 (10jcrespo) It shows up also here: {F70074799} Maybe it is expected?
[11:26:21] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:26:37] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2011 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[11:26:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[11:26:57] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:27:06] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/14 (Core: lsw1-d6-eqiad:ethernet-1/56 {#B00369}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[11:27:35] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2011 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[11:28:15] <wikibugs>	 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11358013 (10jcrespo) The found workaround: https://alerts.wikimedia.org/?q=%40silenced_by%3D6c0e20b0-632b-4410-be33-32f631f020a5
[11:28:22] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[11:29:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:29:03] <wikibugs>	 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11358016 (10Volans) Possibly related to T328869
[11:31:57] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:34:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:37:20] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198935 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert)
[11:38:17] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Drop trusted proxies support [puppet] - 10https://gerrit.wikimedia.org/r/1203401
[11:38:58] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11358084 (10Jclark-ctr) @jcrespo can this be swapped at anytime or do we need to schedule?
[11:39:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:40:38] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11358086 (10jcrespo) Go ahead if it doesn't require shutdown. If it requires or it is preferred, just let me know and I will perfor mit myself right now, will...
[11:41:57] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:44:02] <jinxer-wm>	 FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:46:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[11:48:43] <jinxer-wm>	 FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2010:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[11:49:02] <jinxer-wm>	 RESOLVED: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:51:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1182658 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[11:53:14] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Remove abuse_networks netmapper lookup [puppet] - 10https://gerrit.wikimedia.org/r/1203415
[11:53:43] <jinxer-wm>	 RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2010:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[11:54:31] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Remove abuse_networks netmapper lookup [puppet] - 10https://gerrit.wikimedia.org/r/1203415
[11:56:43] <wikibugs>	 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11358164 (10jcrespo) A yes, I couldn't find it before. I will merge it there.
[11:56:54] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] varnish: Remove abuse_networks netmapper lookup [puppet] - 10https://gerrit.wikimedia.org/r/1203415 (owner: 10Vgutierrez)
[11:57:10] <wikibugs>	 06SRE, 10Observability-Alerting: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869#11358168 (10jcrespo)
[11:57:12] <wikibugs>	 06SRE, 10observability, 10Observability-Alerting: Wrong url on checking silence on alertmanager - https://phabricator.wikimedia.org/T409712#11358171 (10jcrespo) →14Duplicate dup:03T328869
[12:00:34] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+2] "varnishtests are happy on both text & upload" [puppet] - 10https://gerrit.wikimedia.org/r/1203415 (owner: 10Vgutierrez)
[12:00:58] <wikibugs>	 (03PS1) 10STran: Deploy temporary accounts to more large projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203416 (https://phabricator.wikimedia.org/T409691)
[12:02:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.185s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:02:45] <wikibugs>	 (03PS2) 10STran: Deploy temporary accounts to more large/LQT-unblocked projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203416 (https://phabricator.wikimedia.org/T409691)
[12:12:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:12:22] <wikibugs>	 (03PS1) 10Brouberol: global_config: define a prometheus external service [puppet] - 10https://gerrit.wikimedia.org/r/1203418 (https://phabricator.wikimedia.org/T403482)
[12:14:50] <wikibugs>	 (03PS1) 10Brouberol: airflow-test-k8s: allow task-pod -> prometheus gateway egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203420 (https://phabricator.wikimedia.org/T403482)
[12:22:29] <wikibugs>	 (03CR) 10Btullis: global_config: define a prometheus external service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203418 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol)
[12:22:42] <wikibugs>	 (03PS1) 10Majavah: P:toolfroge::elasticsearch::haproxy: Use firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1203425
[12:22:42] <wikibugs>	 (03PS1) 10Majavah: P:toolfroge::elasticsearch::haproxy: Enable native Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1203426 (https://phabricator.wikimedia.org/T343885)
[12:22:44] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::prometheus: Use native exporters for HAProxy targets [puppet] - 10https://gerrit.wikimedia.org/r/1203427 (https://phabricator.wikimedia.org/T343885)
[12:22:46] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: Remove legacy HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1203428
[12:23:32] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::elasticsearch::haproxy: Use firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1203425
[12:23:32] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::elasticsearch::haproxy: Enable native Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1203426 (https://phabricator.wikimedia.org/T343885)
[12:23:32] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::prometheus: Use native exporters for HAProxy targets [puppet] - 10https://gerrit.wikimedia.org/r/1203427 (https://phabricator.wikimedia.org/T343885)
[12:23:32] <wikibugs>	 (03PS2) 10Majavah: P:toolforge: Remove legacy HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1203428
[12:24:13] <wikibugs>	 (03PS2) 10Brouberol: global_config: define a prometheus external service [puppet] - 10https://gerrit.wikimedia.org/r/1203418 (https://phabricator.wikimedia.org/T403482)
[12:24:21] <wikibugs>	 (03CR) 10Btullis: airflow-test-k8s: allow task-pod -> prometheus gateway egress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203420 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol)
[12:24:26] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7598/co" [puppet] - 10https://gerrit.wikimedia.org/r/1203425 (owner: 10Majavah)
[12:24:26] <jinxer-wm>	 FIRING: InboundMXQueueHigh: MX host mx-in1001:9154 has many queued messages: 4652 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh
[12:24:28] <wikibugs>	 (03PS2) 10Brouberol: airflow-test-k8s: allow task-pod -> prometheus gateway egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203420 (https://phabricator.wikimedia.org/T403482)
[12:24:45] <wikibugs>	 (03CR) 10Brouberol: global_config: define a prometheus external service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203418 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol)
[12:25:01] <wikibugs>	 (03CR) 10Brouberol: airflow-test-k8s: allow task-pod -> prometheus gateway egress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203420 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol)
[12:25:03] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] db2166: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202775 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto)
[12:25:15] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Upgrade db2199, last backup source with 10.6 to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1203429 (https://phabricator.wikimedia.org/T394487)
[12:27:18] <wikibugs>	 (03PS2) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183)
[12:27:43] <jynus>	 I think the bot didn't work, but there is an ongoing #page related to MX queue
[12:27:50] <wikibugs>	 (03CR) 10Btullis: [C:03+1] global_config: define a prometheus external service [puppet] - 10https://gerrit.wikimedia.org/r/1203418 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol)
[12:28:04] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-test-k8s: allow task-pod -> prometheus gateway egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203420 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol)
[12:28:11] <jynus>	 ah, it worked, just there is a lot of messages here
[12:28:19] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade
[12:28:41] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2166 - Upgrading db2166.codfw.wmnet
[12:29:00] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2166 - Upgrading db2166.codfw.wmnet
[12:29:07] <jynus>	 things started bad at 9:30
[12:29:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) (owner: 10Btullis)
[12:31:05] <elukey>	 here sorry 
[12:31:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:32:23] <elukey>	 acked from the phone
[12:32:24] <arnoldokoth>	 Here as well. Ack'd the page and checking docs if there is anything that can be done.
[12:33:28] <jynus>	 issue is on wikipedia with P
[12:33:39] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] global_config: define a prometheus external service [puppet] - 10https://gerrit.wikimedia.org/r/1203418 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol)
[12:33:48] <logmsgbot>	 fceratto@cumin1003 major-upgrade (PID 1435157) is awaiting input
[12:34:30] <elukey>	 I have zero knowledge of the MX queue so I'll need to take a bit more to check the metrics, but jynus is right, something seems started at around 9:30
[12:34:35] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mediawiki: Update location of startupregistrystats script [puppet] - 10https://gerrit.wikimedia.org/r/1202872 (https://phabricator.wikimedia.org/T409212) (owner: 10Zabe)
[12:35:11] <elukey>	 judging from the queue size though it seems that it got worse two hours afterwards
[12:36:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:37:32] <wikibugs>	 (03CR) 10Nikerabbit: [C:03+1] Remove SpecialContributeSkinsEnabled for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro)
[12:37:50] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203434
[12:38:24] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11358311 (10Raine) >>! In T407094#11357931, @Nahid wrote: > It looks like the dot < **. **>  at the end of the public key is missing in the patch. The dot is actually part o...
[12:38:34] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11358313 (10jcrespo) We are working on it, alarm notified us.
[12:42:20] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: allow task-pod -> prometheus gateway egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203420 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol)
[12:43:18] <Amir1>	 !log drop database if exists de_labswikimedia; drop database if exists en_labswikimedia; drop database if exists flaggedrevs_labswikimedia; drop database if exists liquidthreads_labswikimedia; drop database if exists readerfeedback_labswikimedia; (T297297)
[12:43:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:21] <stashbot>	 T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297
[12:44:10] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:44:47] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:46:18] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[12:46:35] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[12:49:20] <wikibugs>	 (03PS1) 10Brouberol: Revert "airflow-test-k8s: allow task-pod -> prometheus gateway egress" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203444
[12:49:40] <Amir1>	 !log drop database if exists tokiponawiki; drop database if exists tokiponawikibooks; drop database if exists tokiponawikiquote; drop database if exists tokiponawiktionary; (T297297)
[12:49:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:44] <stashbot>	 T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297
[12:50:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:53:32] <wikibugs>	 (03PS1) 10Brouberol: mediawiki-dumps-legacy: enable egress to the prometheus-pushgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203445 (https://phabricator.wikimedia.org/T403482)
[12:55:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:56:46] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2166 gradually with 4 steps - Migration of db2166.codfw.wmnet completed
[12:59:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] clouddb1026-1033: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1203271 (https://phabricator.wikimedia.org/T409162) (owner: 10Andrew Bogott)
[12:59:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11358376 (10Jclark-ctr)
[12:59:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T409646#11358378 (10Jclark-ctr) →14Duplicate dup:03T408065
[13:00:14] <wikibugs>	 (03PS3) 10Andrew Bogott: clouddb1026-1033: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1203271 (https://phabricator.wikimedia.org/T409162)
[13:00:15] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1203271 (https://phabricator.wikimedia.org/T409162) (owner: 10Andrew Bogott)
[13:00:34] <wikibugs>	 (03PS2) 10Brouberol: mediawiki-dumps-legacy: enable egress to the prometheus-pushgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203445 (https://phabricator.wikimedia.org/T403482)
[13:02:10] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11358382 (10elukey) I am really really ignorant about postfix so please bear with me :)  I ran:  ` elukey@mx-in1001:~$ f...
[13:04:03] <wikibugs>	 (03CR) 10Majavah: [C:04-1] toolforge haproxy config: replace httpchk with http-check send (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203175 (owner: 10Andrew Bogott)
[13:05:04] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[13:05:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] clouddb1026-1033: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1203271 (https://phabricator.wikimedia.org/T409162) (owner: 10Andrew Bogott)
[13:05:54] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[13:07:09] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply
[13:07:48] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply
[13:08:12] <wikibugs>	 (03PS3) 10Andrew Bogott: toolforge haproxy config: replace httpchk with http-check send [puppet] - 10https://gerrit.wikimedia.org/r/1203175
[13:08:26] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[13:09:06] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:09:34] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[13:10:46] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 3 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11358432 (10Arnoldokoth)
[13:11:40] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[13:11:44] <elukey>	 !log restart postfix on mx-in2001 to apply an IP ban - T408632
[13:11:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:48] <stashbot>	 T408632: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632
[13:12:02] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11358441 (10Arnoldokoth)
[13:12:18] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[13:12:24] <elukey>	 !log restart postfix on mx-in1001 to apply an IP ban - T408632
[13:12:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:57] <Amir1>	 !log  mwscript-k8s --dblist=medium --follow -- purgeUserOptions.php --login-age 15 (T406724)
[13:14:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:00] <stashbot>	 T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724
[13:15:12] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[13:15:58] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[13:16:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] "seems good!" [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah)
[13:16:31] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply
[13:16:36] <Amir1>	 jouncebot: nowandnext
[13:16:36] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 43 minute(s)
[13:16:36] <jouncebot>	 In 0 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1400)
[13:16:48] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Enable jumbo frames on codfw1dev cloudnets [puppet] - 10https://gerrit.wikimedia.org/r/1203399 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah)
[13:17:19] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply
[13:17:31] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11358469 (10Jclark-ctr) a:03Jclark-ctr
[13:17:42] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Remove nlwiki exception from thumb limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201584 (https://phabricator.wikimedia.org/T408715) (owner: 10Ladsgroup)
[13:18:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201584 (https://phabricator.wikimedia.org/T408715) (owner: 10Ladsgroup)
[13:18:26] <wikibugs>	 (03Merged) 10jenkins-bot: Remove nlwiki exception from thumb limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201584 (https://phabricator.wikimedia.org/T408715) (owner: 10Ladsgroup)
[13:19:05] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1201584|Remove nlwiki exception from thumb limits (T408715)]]
[13:19:08] <stashbot>	 T408715: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715
[13:19:26] <jinxer-wm>	 FIRING: [2x] InboundMXQueueHigh: MX host mx-in1001:9154 has many queued messages: 22380 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh
[13:23:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:27:18] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[13:27:28] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Enable jumbo frames on remaining codfw1dev nodes [puppet] - 10https://gerrit.wikimedia.org/r/1203400 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah)
[13:27:38] <wikibugs>	 (03PS3) 10KartikMistry: machinetranslation: Increase replica and memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371)
[13:27:38] <wikibugs>	 (03PS1) 10KartikMistry: Update Recommnedation API to 2025-11-07-162011-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203450 (https://phabricator.wikimedia.org/T405000)
[13:28:14] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[13:28:20] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11358526 (10elukey) Judging from the [[ https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers?orgId=1&from=now-...
[13:28:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:29:39] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply
[13:30:27] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply
[13:30:44] <wikibugs>	 (03PS2) 10KartikMistry: Update Recommnedation API to 2025-11-07-162011-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203450 (https://phabricator.wikimedia.org/T405000)
[13:33:51] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: neutron: Enable jumbo frames in codfw1dev Neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah)
[13:34:43] <wikibugs>	 (03PS3) 10Majavah: P:openstack: neutron: Enable jumbo frames in codfw1dev Neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544)
[13:35:07] <wikibugs>	 (03CR) 10Sbisson: [C:03+1] Update Recommnedation API to 2025-11-07-162011-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203450 (https://phabricator.wikimedia.org/T405000) (owner: 10KartikMistry)
[13:35:51] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update Recommnedation API to 2025-11-07-162011-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203450 (https://phabricator.wikimedia.org/T405000) (owner: 10KartikMistry)
[13:37:05] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:openstack: neutron: Enable jumbo frames in codfw1dev Neutron config [puppet] - 10https://gerrit.wikimedia.org/r/1203402 (https://phabricator.wikimedia.org/T409544) (owner: 10Majavah)
[13:37:31] <wikibugs>	 (03Merged) 10jenkins-bot: Update Recommnedation API to 2025-11-07-162011-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203450 (https://phabricator.wikimedia.org/T405000) (owner: 10KartikMistry)
[13:41:21] <wikibugs>	 (03PS1) 10Brouberol: airflow: assume the PYTHONPATH env var is defined in the airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203451 (https://phabricator.wikimedia.org/T408711)
[13:41:38] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[13:42:15] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2166 gradually with 4 steps - Migration of db2166.codfw.wmnet completed
[13:42:16] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0)
[13:43:36] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1201584|Remove nlwiki exception from thumb limits (T408715)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:43:39] <stashbot>	 T408715: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715
[13:44:48] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[13:47:49] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731 (10cmooney) 03NEW p:05Triage→03High
[13:50:28] <wikibugs>	 (03PS14) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387)
[13:50:41] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[13:51:04] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-serve: tweak aya llm mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203453 (https://phabricator.wikimedia.org/T403697)
[13:51:07] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade
[13:51:27] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+2 C:03+2] varnish: Remove abuse_networks netmapper lookup [puppet] - 10https://gerrit.wikimedia.org/r/1203415 (owner: 10Vgutierrez)
[13:51:29] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2167 - Upgrading db2167.codfw.wmnet
[13:51:41] <kart_>	 I've merged patch https://gerrit.wikimedia.org/r/1203450 but it isn't available on the deployment server to deploy. What can be the reason? The diff is empty.
[13:51:58] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2167 - Upgrading db2167.codfw.wmnet
[13:52:42] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] db2167: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202776 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto)
[13:52:55] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-d6-eqiad (10.64.128.29) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:53:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11358656 (10cmooney) I can take a look at this unless there is another plan?
[13:55:11] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: 10Krinkle)
[13:57:06] <logmsgbot>	 fceratto@cumin1003 major-upgrade (PID 1518129) is awaiting input
[13:57:40] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: Drop trusted proxies support [puppet] - 10https://gerrit.wikimedia.org/r/1203401
[13:57:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza)
[13:59:59] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201584|Remove nlwiki exception from thumb limits (T408715)]] (duration: 40m 54s)
[14:00:03] <stashbot>	 T408715: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1400).
[14:00:06] <jouncebot>	 edsanders and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:08] <edsanders>	 o/
[14:00:31] <edsanders>	 I can self deploy
[14:02:26] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+2] "varnishtests are happy for both text & upload clusters" [puppet] - 10https://gerrit.wikimedia.org/r/1203401 (owner: 10Vgutierrez)
[14:02:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202985 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders)
[14:03:42] <wikibugs>	 (03Merged) 10jenkins-bot: Freeze LiquidThreads on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202985 (https://phabricator.wikimedia.org/T405080) (owner: 10Esanders)
[14:04:01] <logmsgbot>	 !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1202985|Freeze LiquidThreads on enwiktionary (T405080)]]
[14:04:05] <stashbot>	 T405080: Convert LQT pages on enwiktionary to Flow - https://phabricator.wikimedia.org/T405080
[14:06:41] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie
[14:08:08] <logmsgbot>	 !log esanders@deploy2002 esanders: Backport for [[gerrit:1202985|Freeze LiquidThreads on enwiktionary (T405080)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:08:31] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[14:09:59] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Revert "airflow-test-k8s: allow task-pod -> prometheus gateway egress" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203444 (owner: 10Brouberol)
[14:10:03] <wikibugs>	 (03CR) 10Elukey: "The change is ok but is aya meant to run with only 8G of memory? I'd defer to Aiko for the final +1, since I suspect that we may need more" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203453 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski)
[14:10:44] <logmsgbot>	 !log esanders@deploy2002 esanders: Continuing with sync
[14:10:47] <wikibugs>	 (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: enable egress to the prometheus-pushgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203445 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol)
[14:13:22] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:14:26] <jinxer-wm>	 RESOLVED: [2x] InboundMXQueueHigh: MX host mx-in1001:9154 has many queued messages: 1058 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh
[14:15:19] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11358737 (10Jclark-ctr) @jcrespo This server is out of warranty. I replaced the disk with one from a decommissioned server; the drive was erased prior to inst...
[14:15:56] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[14:17:49] <logmsgbot>	 !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202985|Freeze LiquidThreads on enwiktionary (T405080)]] (duration: 13m 48s)
[14:17:53] <stashbot>	 T405080: Convert LQT pages on enwiktionary to Flow - https://phabricator.wikimedia.org/T405080
[14:20:56] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[14:23:41] <wikibugs>	 (03CR) 10BBlack: [C:03+1] "LGTM! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1203401 (owner: 10Vgutierrez)
[14:26:10] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2167 gradually with 4 steps - Migration of db2167.codfw.wmnet completed
[14:31:12] <kart_>	 !log Update Recommnedation API to 2025-11-07-162011-production (T405000, T406854, T408936, T408937, T408934)
[14:31:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:25] <stashbot>	 T405000: Handle failure to load languages from cx server - https://phabricator.wikimedia.org/T405000
[14:31:26] <stashbot>	 T406854: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854
[14:31:26] <stashbot>	 T408936: Error when calling the wikidata API with no titles - https://phabricator.wikimedia.org/T408936
[14:31:27] <stashbot>	 T408937: Faulty error handling when fetching language pairs - https://phabricator.wikimedia.org/T408937
[14:31:28] <stashbot>	 T408934: Production error: AttributeError: 'NoneType' object has no attribute 'keys' - https://phabricator.wikimedia.org/T408934
[14:31:36] <wikibugs>	 (03PS1) 10Esanders: Create maintenance script to apply manual fixes [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203457 (https://phabricator.wikimedia.org/T397426)
[14:31:53] <wikibugs>	 (03Abandoned) 10CDanis: discovery.wmnet: add gerrit alias [dns] - 10https://gerrit.wikimedia.org/r/1198352 (https://phabricator.wikimedia.org/T365259) (owner: 10CDanis)
[14:32:00] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] Deploy temporary accounts to more large/LQT-unblocked projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203416 (https://phabricator.wikimedia.org/T409691) (owner: 10STran)
[14:32:04] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11358831 (10elukey) Looks like we are back in acceptable ranges again! Please let me know if anything is missing.
[14:32:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203457 (https://phabricator.wikimedia.org/T397426) (owner: 10Esanders)
[14:32:38] <Superpes52>	 Hi sorry I've some problem with IRC... Are you deploying rn?
[14:33:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203457 (https://phabricator.wikimedia.org/T397426) (owner: 10Esanders)
[14:33:30] <edsanders>	 Superpes52: hey I just started my second patch
[14:34:11] <edsanders>	 Superpes52: shall I do your config change after, or can you deploy by yourself?
[14:34:42] <Superpes52>	 edsanders Ah so I'm on time! I'm not a deployer :)
[14:34:57] <edsanders>	 ok, no problem
[14:37:23] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "I also suggest using the default permission (0755) for consistency and treating the whole data directory and Prometheus as Grafana instanc" [puppet] - 10https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[14:41:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11358897 (10fgiunchedi) Yes please @cmooney, much appreciated! Note that this is currently not a blocker / not high...
[14:42:44] <wikibugs>	 (03Merged) 10jenkins-bot: Create maintenance script to apply manual fixes [extensions/Flow] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203457 (https://phabricator.wikimedia.org/T397426) (owner: 10Esanders)
[14:43:06] <logmsgbot>	 !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1203457|Create maintenance script to apply manual fixes (T397426)]]
[14:43:10] <stashbot>	 T397426: Implement bulk fixes on ptwikibooks - https://phabricator.wikimedia.org/T397426
[14:45:05] <logmsgbot>	 !log esanders@deploy2002 esanders: Backport for [[gerrit:1203457|Create maintenance script to apply manual fixes (T397426)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:45:30] <logmsgbot>	 !log esanders@deploy2002 esanders: Continuing with sync
[14:45:51] <icinga-wm>	 PROBLEM - Host cloudnet2005-dev is DOWN: PING CRITICAL - Packet loss = 100%
[14:46:33] <wikibugs>	 (03PS1) 10Bking: opensearch-cluster: raise defaults to match design doc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203458 (https://phabricator.wikimedia.org/T409501)
[14:48:19] <icinga-wm>	 RECOVERY - Host cloudnet2005-dev is UP: PING OK - Packet loss = 0%, RTA = 31.39 ms
[14:49:09] <wikibugs>	 06SRE, 06Traffic: Meta query about why we map 31.13.103.0/24 to US - https://phabricator.wikimedia.org/T409735 (10cmooney) 03NEW p:05Triage→03Medium
[14:50:27] <logmsgbot>	 !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203457|Create maintenance script to apply manual fixes (T397426)]] (duration: 07m 21s)
[14:50:32] <stashbot>	 T397426: Implement bulk fixes on ptwikibooks - https://phabricator.wikimedia.org/T397426
[14:50:39] <icinga-wm>	 PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100%
[14:50:48] <edsanders>	 Superpes52: shall I start your config change?
[14:50:59] <Superpes52>	 Yep thanks edsanders
[14:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:51:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203238 (https://phabricator.wikimedia.org/T409578) (owner: 10Superpes15)
[14:52:05] <wikibugs>	 (03Merged) 10jenkins-bot: [ptwiki] Add new abusefilter usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203238 (https://phabricator.wikimedia.org/T409578) (owner: 10Superpes15)
[14:52:22] <logmsgbot>	 !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1203238|[ptwiki] Add new abusefilter usergroup (T409578)]]
[14:52:26] <stashbot>	 T409578: Create new user group on ptwiki "Administrador do filtro de abusos" - https://phabricator.wikimedia.org/T409578
[14:53:35] <icinga-wm>	 RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms
[14:54:53] <logmsgbot>	 !log esanders@deploy2002 superpes, esanders: Backport for [[gerrit:1203238|[ptwiki] Add new abusefilter usergroup (T409578)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:55:04] <Superpes52>	 Testing
[14:55:37] <Superpes52>	 Looks fine thanks edsanders :)
[14:55:43] <logmsgbot>	 !log esanders@deploy2002 superpes, esanders: Continuing with sync
[15:00:00] <logmsgbot>	 !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203238|[ptwiki] Add new abusefilter usergroup (T409578)]] (duration: 07m 37s)
[15:00:04] <stashbot>	 T409578: Create new user group on ptwiki "Administrador do filtro de abusos" - https://phabricator.wikimedia.org/T409578
[15:00:21] <Superpes52>	 Many thanks for your assistance edsanders :3
[15:01:53] <edsanders>	 no problem
[15:05:14] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11359057 (10jcrespo) I will keep an eye on it until it gets rebuilt, thanks for the quick help. I will also have a look at the warnings.
[15:08:22] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:32] <wikibugs>	 (03PS1) 10CDanis: base: add bat (batcat) to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1203462
[15:11:39] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2167 gradually with 4 steps - Migration of db2167.codfw.wmnet completed
[15:11:40] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0)
[15:15:31] <wikibugs>	 (03CR) 10Dpogorzelski: "i just need to be able to iterate and make tiny but consistent progress. these are mostly dev changes so i think i'll just go ahead and me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203453 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski)
[15:18:52] <logmsgbot>	 andrew@cumin2002 reimage (PID 1974214) is awaiting input
[15:25:16] <Amir1>	 !log drop database if exists webshop (T297297)
[15:25:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:20] <stashbot>	 T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297
[15:26:26] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: enable egress to the prometheus-pushgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203445 (https://phabricator.wikimedia.org/T403482) (owner: 10Brouberol)
[15:26:41] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Revert "airflow-test-k8s: allow task-pod -> prometheus gateway egress" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203444 (owner: 10Brouberol)
[15:26:58] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-serve: tweak aya llm mem limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203453 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski)
[15:27:24] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11359162 (10jcrespo) I saw the warnings, but I see no problem on the logs, other than it detecting your disk change and firmware update. Once the disk rebuild...
[15:29:05] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[15:29:14] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[15:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1530)
[15:32:11] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+2 C:03+2] varnish: Drop trusted proxies support [puppet] - 10https://gerrit.wikimedia.org/r/1203401 (owner: 10Vgutierrez)
[15:33:22] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:18] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[15:37:40] <swfrench-wmf>	 jouncebot: nowandnext
[15:37:40] <jouncebot>	 For the next 0 hour(s) and 22 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1530)
[15:37:40] <jouncebot>	 In 0 hour(s) and 52 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1630)
[15:40:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11359233 (10LSobanski) p:05Triage→03Medium
[15:41:00] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-services: reduce aya's cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203473 (https://phabricator.wikimedia.org/T409414)
[15:41:33] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11359238 (10LSobanski) 05Open→03Resolved Resolving, please reopen if you still think this is a problem.
[15:42:56] <wikibugs>	 (03CR) 10Elukey: "Post-merge comment: I totally understand you point but we have had problems in the past with models taking some extra memory when loading " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203453 (https://phabricator.wikimedia.org/T403697) (owner: 10Dpogorzelski)
[15:42:58] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-services: reduce aya's cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203473 (https://phabricator.wikimedia.org/T409414) (owner: 10Dpogorzelski)
[15:46:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Sporadic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823#11359271 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving as part of backlog review. There have been changes to the network and Puppet since the creation of this ta...
[15:46:44] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::spicerack_config: Do not log changes of secrets [puppet] - 10https://gerrit.wikimedia.org/r/1203474 (https://phabricator.wikimedia.org/T409741)
[15:46:45] <wikibugs>	 (03PS2) 10Dpogorzelski: ml-services: reduce aya's cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203473 (https://phabricator.wikimedia.org/T409414)
[15:47:09] <wikibugs>	 (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml-services: reduce aya's cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203473 (https://phabricator.wikimedia.org/T409414) (owner: 10Dpogorzelski)
[15:47:52] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[15:50:16] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: pilot cfssl/pki for etcd on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1182658 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[15:51:13] <swfrench-wmf>	 I'm going to merge a change shortly that will require a short disruption on a single conftool etcd node in codfw
[15:51:47] <swfrench-wmf>	 although it's unlikely this will cause problems, it's preferable if no mediawiki deployment is ongoing concurrently
[15:52:26] <swfrench-wmf>	 as such, I am going to briefly take the scap lock while the change is happening
[15:52:38] <Lucas_WMDE>	 thanks for deploying edsanders <3
[15:52:42] <Lucas_WMDE>	 (I was busy earlier)
[15:52:50] <wikibugs>	 (03CR) 10David Caro: [C:03+1] P:wmcs::spicerack_config: Do not log changes of secrets [puppet] - 10https://gerrit.wikimedia.org/r/1203474 (https://phabricator.wikimedia.org/T409741) (owner: 10Majavah)
[15:53:28] <logmsgbot>	 !log swfrench@deploy2002 Locking from deployment [ALL REPOSITORIES]: Hold deployments during etcd certificate change - T352245
[15:53:32] <stashbot>	 T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245
[15:56:48] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::spicerack_config: Do not log changes of secrets [puppet] - 10https://gerrit.wikimedia.org/r/1203474 (https://phabricator.wikimedia.org/T409741) (owner: 10Majavah)
[16:05:06] <logmsgbot>	 !log swfrench@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Hold deployments during etcd certificate change - T352245 (duration: 11m 38s)
[16:05:27] <stashbot>	 T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245
[16:05:31] <swfrench-wmf>	 the dust has settled and I've released the lock. thanks all
[16:08:35] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] db2181: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202777 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto)
[16:10:11] <swfrench-wmf>	 !log begin rolling restart of codfw-associated confds after conf2006 etcd restart - T352245
[16:10:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:54] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade
[16:11:17] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2181 - Upgrading db2181.codfw.wmnet
[16:11:35] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2181 - Upgrading db2181.codfw.wmnet
[16:13:06] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.ganeti.reboot-vm: Use skip_acked=True [cookbooks] - 10https://gerrit.wikimedia.org/r/1203483 (https://phabricator.wikimedia.org/T330136)
[16:14:04] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2199.codfw.wmnet with reason: MariaDB upgrade
[16:14:35] <logmsgbot>	 fceratto@cumin1003 major-upgrade (PID 1656166) is awaiting input
[16:16:28] <wikibugs>	 (03PS1) 10Brouberol: growthbook: add omitted pvc.yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203486 (https://phabricator.wikimedia.org/T408415)
[16:17:13] <wikibugs>	 (03CR) 10Muehlenhoff: "It's more, it's also the official tool to self-manage your developer account (change email, change Cloud SSH keys e.g.). In addition it ha" [puppet] - 10https://gerrit.wikimedia.org/r/1202114 (owner: 10Muehlenhoff)
[16:18:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.ganeti.reboot-vm: Use skip_acked=True [cookbooks] - 10https://gerrit.wikimedia.org/r/1203483 (https://phabricator.wikimedia.org/T330136) (owner: 10Muehlenhoff)
[16:18:47] <wikibugs>	 (03PS1) 10Brouberol: Define the synthetic data PG data source in configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203487 (https://phabricator.wikimedia.org/T409591)
[16:21:12] <wikibugs>	 (03PS2) 10Brouberol: growthbook: add omitted pvc.yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203486 (https://phabricator.wikimedia.org/T408415)
[16:21:28] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Define the synthetic data PG data source in configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203487 (https://phabricator.wikimedia.org/T409591) (owner: 10Brouberol)
[16:23:36] <wikibugs>	 (03PS2) 10Muehlenhoff: sre.ganeti.reboot-vm: Use skip_acked=True [cookbooks] - 10https://gerrit.wikimedia.org/r/1203483 (https://phabricator.wikimedia.org/T330136)
[16:23:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] growthbook: add omitted pvc.yaml file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203486 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol)
[16:23:45] <wikibugs>	 (03PS1) 10CDanis: admin: deployment: add volker-e & new ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1203488 (https://phabricator.wikimedia.org/T406243)
[16:25:21] <wikibugs>	 (03CR) 10CDanis: "Please confirm whether or not you still want the old ssh-rsa key kept active as well, and then we'll get the access updated too.  Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1203488 (https://phabricator.wikimedia.org/T406243) (owner: 10CDanis)
[16:25:25] <logmsgbot>	 fceratto@cumin1003 major-upgrade (PID 1656166) is awaiting input
[16:27:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11359455 (10Jclark-ctr) I did finally get confirmation on tracking on replacement memory It should be onsite by end of day tomorrow Unless Delayed by holiday.  Can i repla...
[16:27:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11359467 (10Jclark-ctr) a:05Marostegui→03Jclark-ctr
[16:28:27] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[16:28:41] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[16:29:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.ganeti.reboot-vm: Use skip_acked=True [cookbooks] - 10https://gerrit.wikimedia.org/r/1203483 (https://phabricator.wikimedia.org/T330136) (owner: 10Muehlenhoff)
[16:30:05] <jouncebot>	 jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1630).
[16:30:59] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply updates - bking@cumin1002 - T407110
[16:31:35] <logmsgbot>	 !log bking@cumin1002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply updates - bking@cumin1002 - T407110
[16:31:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad reboot failed, stuck in UEFI shell - https://phabricator.wikimedia.org/T409731#11359494 (10Jclark-ctr)
[16:31:38] <wikibugs>	 (03PS4) 10Jdlrobson: Update QuickSurvey platforms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482
[16:31:47] <wikibugs>	 (03CR) 10CDanis: [C:03+2] admin: Update brett SSH key to FIDO [puppet] - 10https://gerrit.wikimedia.org/r/1203179 (https://phabricator.wikimedia.org/T409600) (owner: 10BCornwall)
[16:31:55] <wikibugs>	 (03PS1) 10Mmartorana: Security-landing-page: bump image to 2025-10-27-155537 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203489 (https://phabricator.wikimedia.org/T404996)
[16:32:06] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: reboot to apply updates - bking@cumin1002 - T407110
[16:34:02] <logmsgbot>	 !log bking@cumin1002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: reboot to apply updates - bking@cumin1002 - T407110
[16:34:05] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: reboot to apply updates - bking@cumin1002 - T407110
[16:41:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1026:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1026:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:42:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: push_cross_cluster_settings_9200.service on relforge1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:44:08] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Wikimedia-Mailing-lists: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11359587 (10LSobanski) p:05Triage→03Low
[16:47:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on relforge1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:47:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[16:48:34] <logmsgbot>	 !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: reboot to apply updates - bking@cumin1002 - T407110
[16:51:40] <wikibugs>	 (03PS1) 10Muehlenhoff: test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860)
[16:52:02] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:52:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[16:56:17] <icinga-wm>	 RECOVERY - MegaRAID on db1171 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:56:44] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] ingress: remove reference to defunct template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196174 (https://phabricator.wikimedia.org/T406876) (owner: 10Bking)
[16:57:02] <jinxer-wm>	 FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:57:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on relforge1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:58:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff)
[16:58:19] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:58:50] <wikibugs>	 (03CR) 10SBassett: [C:03+2] Security-landing-page: bump image to 2025-10-27-155537 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203489 (https://phabricator.wikimedia.org/T404996) (owner: 10Mmartorana)
[16:59:50] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498)
[17:01:15] <wikibugs>	 (03Merged) 10jenkins-bot: Security-landing-page: bump image to 2025-10-27-155537 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203489 (https://phabricator.wikimedia.org/T404996) (owner: 10Mmartorana)
[17:02:02] <jinxer-wm>	 FIRING: [13x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[17:02:24] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:02:25] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on relforge1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:02:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11359722 (10CDanis) Hi @Chandra-WMDE , seems like you posted the private key in the task instead of the public.  Please stop using that key for anything, and generate a new one,...
[17:06:57] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:07:07] <jinxer-wm>	 FIRING: [15x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[17:09:06] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:10:55] <wikibugs>	 (03PS1) 10CDanis: admin: btullis: remove old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1203495 (https://phabricator.wikimedia.org/T409279)
[17:11:57] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:12:08] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11359791 (10Jdrewniak) hi @Dzahn, I just confirmed with @cmadeo that the desired domain/path for this microsite is actually:   https://www.wikipedia.org/25-years-o...
[17:13:11] <wikibugs>	 (03CR) 10JMeybohm: "Looks good, thanks! I would suggest to add the include to a couple of more helmfile files in order to make sure the CI change does not sta" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková)
[17:13:22] <wikibugs>	 (03CR) 10JMeybohm: "I don't think removing from general-* files will work as of now since admin_ng helmfiles ingest the value from there." [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková)
[17:16:09] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2181 gradually with 4 steps - Migration of db2181.codfw.wmnet completed
[17:16:57] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:21:26] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: New SSH key for Brett Cornwall - https://phabricator.wikimedia.org/T409600#11359832 (10CDanis) 05Open→03Resolved merged and fast-deployed to `A:bastion OR A:cumin`
[17:21:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1203495 (https://phabricator.wikimedia.org/T409279) (owner: 10CDanis)
[17:21:57] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:22:16] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:22:42] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Upgrade db2199, last backup source with 10.6 to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1203429 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo)
[17:26:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove leftover import of python-elastic [cookbooks] - 10https://gerrit.wikimedia.org/r/1203496 (https://phabricator.wikimedia.org/T390860)
[17:26:57] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:30:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1203495 (https://phabricator.wikimedia.org/T409279) (owner: 10CDanis)
[17:31:57] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service wdqs1026:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:33:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove leftover import of python-elastic [cookbooks] - 10https://gerrit.wikimedia.org/r/1203496 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff)
[17:39:45] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Remove leftover import of python-elastic [cookbooks] - 10https://gerrit.wikimedia.org/r/1203496 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff)
[17:48:38] <wikibugs>	 (03CR) 10BPirkle: [C:03+1] Change RESTbase => REST in wgRestSandboxSpecs names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (owner: 10Aaron Schulz)
[17:49:26] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:53:23] <wikibugs>	 (03PS1) 10Kgraessle: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727)
[17:54:24] <hauskater>	 phabricator.wikimedia.org seems down? Getting: Request served via cp3070 cp3070, Varnish XID 37697385
[17:54:25] <hauskater>	 Upstream caches: cp3070 int
[17:54:25] <hauskater>	 Error: 403, 02cd48e281926cca9 (0930e9c) at Mon, 10 Nov 2025 17:53:59 GMT
[17:54:26] <jinxer-wm>	 RESOLVED: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:59:36] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:00:04] <jouncebot>	 swfrench-wmf: OwO what's this, a deployment window?? MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1800). nyaa~
[18:00:05] <jouncebot>	 ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T1800).
[18:01:16] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): serve 50% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203284 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[18:01:38] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2181 gradually with 4 steps - Migration of db2181.codfw.wmnet completed
[18:01:39] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0)
[18:01:49] <logmsgbot>	 !log mmartorana@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[18:02:16] <logmsgbot>	 !log mmartorana@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[18:02:49] <logmsgbot>	 !log mmartorana@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[18:03:07] <logmsgbot>	 !log mmartorana@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[18:03:10] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(api-ext|web): serve 50% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203284 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[18:03:34] <logmsgbot>	 !log mmartorana@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[18:03:54] <logmsgbot>	 !log mmartorana@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[18:04:05] <logmsgbot>	 !log mmartorana@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[18:04:14] <logmsgbot>	 !log mmartorana@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[18:04:26] <logmsgbot>	 !log mmartorana@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[18:04:31] <logmsgbot>	 !log mmartorana@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[18:04:43] <wikibugs>	 (03PS1) 10Elukey: containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891)
[18:05:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey)
[18:05:18] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[18:05:38] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[18:05:44] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[18:06:06] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[18:07:38] <wikibugs>	 (03PS2) 10Elukey: containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891)
[18:09:38] <wikibugs>	 (03PS3) 10Elukey: containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891)
[18:09:59] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on people1004.eqiad.wmnet with reason: decom
[18:10:21] <wikibugs>	 (03PS4) 10Elukey: containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891)
[18:10:29] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[18:10:34] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts people1004.eqiad.wmnet
[18:10:38] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey)
[18:10:46] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[18:11:04] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[18:11:20] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[18:11:46] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[18:11:57] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:11:59] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[18:12:07] <jinxer-wm>	 FIRING: [15x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:12:16] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[18:12:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[18:12:44] <ryankemper>	 !log [WDQS] Restarted wdqs-main in codfw
[18:12:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:00] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2090.codfw.wmnet with OS bullseye
[18:14:06] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:14:07] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11360109 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2090.codfw.wmnet with OS bullseye
[18:14:07] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s4 on db2199 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:14:09] <icinga-wm>	 PROBLEM - MariaDB read only s4 on db2199 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[18:14:13] <icinga-wm>	 PROBLEM - mysqld processes on db2199 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[18:14:31] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2091.codfw.wmnet with OS bullseye
[18:14:35] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 on db2199 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:14:45] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11360110 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2091.codfw.wmnet with OS bullseye
[18:15:00] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2093.codfw.wmnet with OS bullseye
[18:15:06] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11360111 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2093.codfw.wmnet with OS bullseye
[18:15:27] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2094.codfw.wmnet with OS bullseye
[18:15:36] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11360113 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2094.codfw.wmnet with OS bullseye
[18:15:43] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[18:15:50] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.dns.netbox
[18:15:51] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[18:16:02] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[18:16:16] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[18:17:35] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2091.codfw.wmnet with reason: host reimage
[18:17:54] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2093.codfw.wmnet with reason: host reimage
[18:18:25] <wikibugs>	 (03CR) 10Scott French: [C:03+2] deployment_server: migrate mw-(cron|videoscaler) to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203285 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[18:18:36] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie
[18:18:58] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie
[18:20:11] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[18:20:32] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[18:20:36] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[18:20:52] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[18:21:25] <logmsgbot>	 dzahn@cumin2002 decommission (PID 2061752) is awaiting input
[18:22:02] <jinxer-wm>	 FIRING: [15x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:22:50] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[18:22:58] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[18:23:02] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[18:23:14] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[18:23:31] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: people1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002"
[18:24:01] <icinga-wm>	 PROBLEM - Host db2199 is DOWN: PING CRITICAL - Packet loss = 100%
[18:24:55] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2091.codfw.wmnet with reason: host reimage
[18:25:00] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: people1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002"
[18:25:01] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:25:01] <icinga-wm>	 RECOVERY - Host db2199 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms
[18:25:02] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts people1004.eqiad.wmnet
[18:25:09] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s4 on db2199 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:25:11] <icinga-wm>	 PROBLEM - MariaDB read only s4 on db2199 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[18:25:16] <jynus>	 ^downtime expired
[18:25:17] <icinga-wm>	 PROBLEM - mysqld processes on db2199 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[18:25:18] <jynus>	 fixing
[18:25:35] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 on db2199 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:26:09] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11360194 (10Dzahn) a:05Dzahn→03SKaram-WMF
[18:26:17] <icinga-wm>	 RECOVERY - mysqld processes on db2199 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[18:26:35] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 on db2199 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:27:02] <jinxer-wm>	 FIRING: [15x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:27:07] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s4 on db2199 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:27:11] <icinga-wm>	 RECOVERY - MariaDB read only s4 on db2199 is OK: Version 10.11.14-MariaDB-log, Uptime 64s, read_only: True, event_scheduler: True, 3987.95 QPS, connection latency: 0.028184s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[18:27:52] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[18:28:52] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2093.codfw.wmnet with reason: host reimage
[18:29:47] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: No-deployment scap run to switch mw-(cron|videoscaler) to PHP 8.3 - T405955
[18:29:51] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[18:30:04] <logmsgbot>	 !log swfrench@deploy2002 Stopping before sync operations
[18:32:02] <jinxer-wm>	 FIRING: [14x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:32:37] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[18:33:07] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[18:33:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11360252 (10Dzahn) The first space character separates the key from the comment field. It should work with or without the comment field though.  To debug I recommend first v...
[18:34:20] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[18:34:49] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[18:35:22] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts people2003.codfw.wmnet
[18:39:21] <wikibugs>	 (03PS1) 10Dzahn: site: remove decom'ed people bookworm machines [puppet] - 10https://gerrit.wikimedia.org/r/1203502 (https://phabricator.wikimedia.org/T408713)
[18:40:19] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.dns.netbox
[18:42:02] <jinxer-wm>	 FIRING: [12x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:42:38] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply
[18:43:08] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
[18:44:01] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: people2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002"
[18:44:28] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: people2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002"
[18:44:29] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:44:30] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts people2003.codfw.wmnet
[18:45:12] <mutante>	 !log destroyed former people.wikimedia.org backends people1004/people2003 - replaced by trixie VMs people1005/people2004 
[18:45:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:52] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] site: remove decom'ed people bookworm machines [puppet] - 10https://gerrit.wikimedia.org/r/1203502 (https://phabricator.wikimedia.org/T408713) (owner: 10Dzahn)
[18:46:59] <wikibugs>	 (03PS2) 10Dzahn: site: remove decom'ed people bookworm machines [puppet] - 10https://gerrit.wikimedia.org/r/1203502 (https://phabricator.wikimedia.org/T408713)
[18:47:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:49:16] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] site: remove decom'ed people bookworm machines [puppet] - 10https://gerrit.wikimedia.org/r/1203502 (https://phabricator.wikimedia.org/T408713) (owner: 10Dzahn)
[18:50:04] <wikibugs>	 (03PS4) 10Kamila Součková: mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969)
[18:50:48] <wikibugs>	 (03CR) 10Kamila Součková: "Adding to a couple more helmfiles done, let's see what CI thinks :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková)
[18:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:51:26] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:52:02] <jinxer-wm>	 FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:52:11] <wikibugs>	 (03CR) 10Aaron Schulz: Change RESTbase => REST in wgRestSandboxSpecs names (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (owner: 10Aaron Schulz)
[18:54:18] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2090.codfw.wmnet with reason: host reimage
[18:56:26] <jinxer-wm>	 RESOLVED: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:57:02] <jinxer-wm>	 FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:58:36] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2090.codfw.wmnet with reason: host reimage
[19:01:42] <wikibugs>	 06SRE, 06Traffic: Meta query about why we map 31.13.103.0/24 to US - https://phabricator.wikimedia.org/T409735#11360362 (10ssingh) Thanks for filing this task @cmooney! The geofeed link above is very helpful. So it seems from the above (57.141.8.0/24, 57.141.8.0/24), we are missing the entries in the geo-maps...
[19:02:02] <jinxer-wm>	 RESOLVED: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[19:04:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková)
[19:11:24] <logmsgbot>	 andrew@cumin2002 reimage (PID 2064112) is awaiting input
[19:11:57] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498)
[19:16:44] <wikibugs>	 (03PS5) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498)
[19:19:34] <wikibugs>	 (03PS3) 10Arlolra: Deploy Parsoid Read Views to 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593)
[19:20:00] <wikibugs>	 (03CR) 10Arlolra: Deploy Parsoid Read Views to 13 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593) (owner: 10Arlolra)
[19:20:48] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593) (owner: 10Arlolra)
[19:25:57] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11360447 (10Dzahn) Hello @Jdrewniak Do you really mean wikiPedia.org or wikiMedia.org? Just wanted to double check first because the config you link to is actually...
[19:32:59] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest-gateway: enable rate limits on some routes in shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202658 (https://phabricator.wikimedia.org/T406498)
[19:35:47] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2094.codfw.wmnet with OS bullseye
[19:35:55] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11360494 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-be2094.codfw.wmnet with OS bullseye execute...
[19:36:25] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11360495 (10Geagea) Now again VRT number - 17 digits  20251110103208628 20251110103208173
[19:44:40] <wikibugs>	 (03PS2) 10CDanis: admin: deployment: add volker-e & rotate his ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1203488 (https://phabricator.wikimedia.org/T406243)
[19:46:24] <wikibugs>	 (03CR) 10CDanis: [C:03+2] "Confirmed via Slack DM" [puppet] - 10https://gerrit.wikimedia.org/r/1203488 (https://phabricator.wikimedia.org/T406243) (owner: 10CDanis)
[19:46:39] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] base: add bat (batcat) to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1203462 (owner: 10CDanis)
[19:46:53] <wikibugs>	 (03CR) 10CDanis: [C:03+2] base: add bat (batcat) to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1203462 (owner: 10CDanis)
[19:47:56] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 06Research, 10Research-collaborations: Hourly pageview data request — Splitsville (2025) and related indie-film Wikipedia pages - https://phabricator.wikimedia.org/T409639#11360548 (10A_smart_kitten) →14Duplicate dup:03T409676
[19:54:15] <tzatziki>	 !log removing 2 files for legal compliance
[19:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:15] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11360557 (10CDanis) 05Open→03Resolved a:03CDanis
[19:58:25] <wikibugs>	 (03CR) 10Muehlenhoff: "Unless it's also available on Buster this would break Puppet on the puppetmaster* nodes, though?" [puppet] - 10https://gerrit.wikimedia.org/r/1203462 (owner: 10CDanis)
[19:59:03] <tzatziki>	 !log removing 1 file for legal compliance
[19:59:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:01] <wikibugs>	 (03PS1) 10CDanis: base: no batcat in <=buster [puppet] - 10https://gerrit.wikimedia.org/r/1203512
[20:04:03] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] base: no batcat in <=buster [puppet] - 10https://gerrit.wikimedia.org/r/1203512 (owner: 10CDanis)
[20:05:11] <wikibugs>	 (03CR) 10CDanis: [C:03+2] base: no batcat in <=buster [puppet] - 10https://gerrit.wikimedia.org/r/1203512 (owner: 10CDanis)
[20:09:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T409022) (owner: 10Pppery)
[20:21:23] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie
[20:25:50] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie
[20:27:11] <wikibugs>	 (03PS5) 10Kamila Součková: mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969)
[20:37:11] <wikibugs>	 (03PS1) 10Mstyles: OATHAuth: Increase 2FA opt-in to 60% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203535 (https://phabricator.wikimedia.org/T399664)
[20:38:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203535 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles)
[20:38:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková)
[20:38:38] <wikibugs>	 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11360708 (10Andrew) On @fgiunchedi's request I tried dd'ing every drive on a server before reimaging but grub still exhibits the issue.
[20:46:20] <wikibugs>	 (03PS2) 10Mstyles: OATHAuth: Increase 2FA opt-in to 70% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203535 (https://phabricator.wikimedia.org/T399664)
[20:50:09] <wikibugs>	 (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková)
[20:57:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482 (owner: 10Jdlrobson)
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T2100).
[21:00:04] <jouncebot>	 RoanKattouw, toyofuku, aude, arlolra, Pppery, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:01:17] <RoanKattouw>	 I'll go last :)
[21:01:44] <aude>	 with spiderpig, how do the backports work now?
[21:02:00] <aude>	 does a deploy do all the patches still?
[21:02:03] <aude>	 deployer
[21:02:27] <RoanKattouw>	 You can use Spiderpig yourself if you have access
[21:02:41] <aude>	 everyone does their own patch?
[21:02:43] <RoanKattouw>	 Otherwise the deployer will do it for you... and they'll just use Spiderpig themselves anyway
[21:02:56] <RoanKattouw>	 If they can, usually yes (not everyone has Spiderpig access)
[21:02:59] <aude>	 ok, deploying my patch
[21:03:12] <Pppery>	 Among other things, I'm here and I don't have spiderpig access
[21:03:50] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Use FancyCaptcha for API edits and page creations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203547 (https://phabricator.wikimedia.org/T405595)
[21:03:59] <RoanKattouw>	 Pppery: I'll do yours after aude is done
[21:05:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy2002 using scap backport" [extensions/ReadingLists] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203139 (https://phabricator.wikimedia.org/T409116) (owner: 10Stoyofuku-wmf)
[21:07:16] <Pppery>	 I don't think it's possible to test mine without actually setting up Tor - do people want me to try to do that or are they willing to deploy without testing
[21:09:06] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:09:07] <RoanKattouw>	 I'm happy to deploy that without fully testing it
[21:09:16] <Pppery>	 OK
[21:09:40] <Pppery>	 I'm scheduling this for deployment on behalf of the community, not because I have a personal stake in it
[21:09:40] <RoanKattouw>	 I would suggest getting someone who does use Tor to test it later though, to verify that the deploy did what you expected it to do
[21:09:58] <Pppery>	 Will do
[21:10:24] <RoanKattouw>	 But when Spiderpig asks me to test the change before it continues the deployment, I'm just going to check that the site still works and then hit continue
[21:12:04] <wikibugs>	 (03PS1) 10CDanis: A modest proposal: run oomd on stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1203548
[21:12:14] <aude>	 is jenkins normally this slow?
[21:12:48] <wikibugs>	 (03Merged) 10jenkins-bot: Use addModuleStyles for ReadingList icons [extensions/ReadingLists] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203139 (https://phabricator.wikimedia.org/T409116) (owner: 10Stoyofuku-wmf)
[21:13:06] <logmsgbot>	 !log aude@deploy2002 Started scap sync-world: Backport for [[gerrit:1203139|Use addModuleStyles for ReadingList icons (T409116)]]
[21:13:10] <stashbot>	 T409116: Move ReadingList/Collections icon up in the loading module sequence - https://phabricator.wikimedia.org/T409116
[21:14:55] <wikibugs>	 (03PS3) 10Jdlrobson: Drop references to removed Advanced mobile contribution configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470)
[21:15:03] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Drop references to removed Advanced mobile contribution configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470) (owner: 10Jdlrobson)
[21:15:10] <RoanKattouw>	 That was only 7 minutes, that's not that slow for an extension change. For config changes it's much faster, but for gated extensions it's slower
[21:15:16] <logmsgbot>	 !log aude@deploy2002 toyofuku, aude: Backport for [[gerrit:1203139|Use addModuleStyles for ReadingList icons (T409116)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:16:57] <logmsgbot>	 !log aude@deploy2002 toyofuku, aude: Continuing with sync
[21:17:55] <logmsgbot>	 andrew@cumin2002 reimage (PID 2110470) is awaiting input
[21:21:22] <logmsgbot>	 !log aude@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203139|Use addModuleStyles for ReadingList icons (T409116)]] (duration: 08m 16s)
[21:21:26] <stashbot>	 T409116: Move ReadingList/Collections icon up in the loading module sequence - https://phabricator.wikimedia.org/T409116
[21:21:29] <aude>	 i'm done
[21:22:54] <wikibugs>	 (03PS1) 10Scott French: Minor usability improvements for known-client objects [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1203550
[21:25:16] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Tested locally at `17556f9`" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1203550 (owner: 10Scott French)
[21:25:33] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] Minor usability improvements for known-client objects [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1203550 (owner: 10Scott French)
[21:25:49] <arlolra>	 who is next?
[21:25:59] <Pppery>	 I think I am
[21:26:05] <RoanKattouw>	 Yes I'll do your patch now
[21:26:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T409022) (owner: 10Pppery)
[21:27:01] <RoanKattouw>	 arlolra: After that, would you like to deploy your own patch, or would you like me to do it for you?
[21:27:07] <wikibugs>	 (03Merged) 10jenkins-bot: Remove extended autoconfirmed time for Tor on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200743 (https://phabricator.wikimedia.org/T409022) (owner: 10Pppery)
[21:27:13] <arlolra>	 I can take care of it, thanks
[21:27:16] <RoanKattouw>	 Great thanks
[21:27:17] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy: Minor usability improvements for known-client objects - swfrench@cumin2002"
[21:27:19] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Minor usability improvements for known-client objects - swfrench@cumin2002
[21:27:26] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1200743|Remove extended autoconfirmed time for Tor on enwiki (T409022)]]
[21:27:28] <RoanKattouw>	 You can go after this one, and then we have Jon's patch, and then my patches
[21:27:29] <stashbot>	 T409022: Remove extended autoconfirmed time for tor users on enwiki - https://phabricator.wikimedia.org/T409022
[21:28:09] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Minor usability improvements for known-client objects - swfrench@cumin2002
[21:28:11] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy: Minor usability improvements for known-client objects - swfrench@cumin2002"
[21:28:23] <arlolra>	 Ok
[21:29:19] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T409638#11360925 (10Jclark-ctr) 05Open→03Resolved Idrac is showing  SYSTEM IS HEALTHY  after rebuilding.
[21:30:08] <logmsgbot>	 !log catrope@deploy2002 catrope, pppery: Backport for [[gerrit:1200743|Remove extended autoconfirmed time for Tor on enwiki (T409022)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:31:10] <logmsgbot>	 !log catrope@deploy2002 catrope, pppery: Continuing with sync
[21:32:13] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 457714272 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:32:32] <Pppery>	 I actually decided to fully test this anyway. I can confirm it works
[21:32:42] <RoanKattouw>	 Great, thank you!
[21:33:35] <Pppery>	 Getting Tor running was much smoother than I thought it would be, and I could exploit a bug/misfeature in TorBlock where it applies to enhanced autoconfirmed standards to every user as seen in their UserRights page, not just you, to avoid having to set up a test account
[21:35:45] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200743|Remove extended autoconfirmed time for Tor on enwiki (T409022)]] (duration: 08m 19s)
[21:35:49] <stashbot>	 T409022: Remove extended autoconfirmed time for tor users on enwiki - https://phabricator.wikimedia.org/T409022
[21:36:06] <RoanKattouw>	 arlolra: Your turn
[21:36:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593) (owner: 10Arlolra)
[21:36:29] <RoanKattouw>	 Wow deployments have gotten a lot faster lately! (cc swfrench-wmf )
[21:36:59] <swfrench-wmf>	 :)
[21:37:13] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 16552 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:37:16] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203142 (https://phabricator.wikimedia.org/T409593) (owner: 10Arlolra)
[21:37:34] <logmsgbot>	 !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1203142|Deploy Parsoid Read Views to 13 wikis (T409593)]]
[21:37:38] <stashbot>	 T409593: Parsoid Read Views to deploy ~2025-11-10 - https://phabricator.wikimedia.org/T409593
[21:39:31] <swfrench-wmf>	 we've done a bit of tuning to make the prod deployment step a bit faster despite some of the awkwardness around the ongoing PHP migration.
[21:39:32] <swfrench-wmf>	 that said, a deployment that incurs a full image build (e.g., due to l10n updates), will still be rather slow
[21:39:44] <wikibugs>	 (03PS1) 10BryanDavis: wikitech: Enable page protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203552 (https://phabricator.wikimedia.org/T409785)
[21:39:57] <RoanKattouw>	 Yeah I'll get to experience that in a little bit, I have an i18n change that I'm backporting (at the end of this window so as to not inconvenience others)
[21:40:16] <logmsgbot>	 !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1203142|Deploy Parsoid Read Views to 13 wikis (T409593)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:40:20] <swfrench-wmf>	 good call scheduling that at the end!
[21:41:31] <logmsgbot>	 !log arlolra@deploy2002 arlolra: Continuing with sync
[21:43:28] <bd808>	 wikitech doesn't have a history of on-wiki discussion for config changes, so I jumped right to a phab task (T409785) and gerrit patch for enabling the new protection indicators from core. Comment on either if you have an argument against turning this on.
[21:43:29] <stashbot>	 T409785: Enable protection indicators for wikitech - https://phabricator.wikimedia.org/T409785
[21:44:34] <Jdlrobson>	 @RoanKattouw are you using the security window after or is it okay if this deploy window goes over a little?
[21:44:42] <Jdlrobson>	 my config changes should be relatively quick and can go out together
[21:44:44] <wikibugs>	 (03PS2) 10BryanDavis: wikitech: Enable page protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203552 (https://phabricator.wikimedia.org/T409785)
[21:44:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11361008 (10Jclark-ctr) replaced failed drive bay 4.  idrac also  now has allert for  A predictive failure detected on drive 0 in disk...
[21:45:23] <RoanKattouw>	 Jdlrobson: I'll do your config changes first and then my time-consuming i18n change
[21:45:35] <RoanKattouw>	 That way I'm only inconveniencing myself with the security window
[21:45:48] <logmsgbot>	 !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203142|Deploy Parsoid Read Views to 13 wikis (T409593)]] (duration: 08m 14s)
[21:45:52] <stashbot>	 T409593: Parsoid Read Views to deploy ~2025-11-10 - https://phabricator.wikimedia.org/T409593
[21:46:01] <arlolra>	 RoanKattouw: back to you
[21:46:26] <RoanKattouw>	 Jdlrobson: You said changes plural? Is there more than just  https://gerrit.wikimedia.org/r/c/1199482/ ?
[21:47:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482 (owner: 10Jdlrobson)
[21:48:28] <wikibugs>	 (03Merged) 10jenkins-bot: Update QuickSurvey platforms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482 (owner: 10Jdlrobson)
[21:48:46] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1199482|Update QuickSurvey platforms]]
[21:49:16] <wikibugs>	 (03CR) 10BryanDavis: "I announced this in a couple of irc channels in case someone has a reason to oppose it. I kind of think we can be WP:BOLD and deploy whene" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203552 (https://phabricator.wikimedia.org/T409785) (owner: 10BryanDavis)
[21:50:42] <Jdlrobson>	 RoanKattouw: yeh i'd like to land https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1200173?usp=search as well if I can
[21:51:02] <Jdlrobson>	 (unused config code)
[21:51:05] <logmsgbot>	 !log catrope@deploy2002 catrope, jdlrobson: Backport for [[gerrit:1199482|Update QuickSurvey platforms]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:51:10] <RoanKattouw>	 OK I'll do that one next
[21:51:19] <wikibugs>	 (03CR) 10Lucas Werkmeister: [C:03+1] wikitech: Enable page protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203552 (https://phabricator.wikimedia.org/T409785) (owner: 10BryanDavis)
[21:51:20] <Jdlrobson>	 sorry i missed you +2ed my change already
[21:51:24] <RoanKattouw>	 Jdlrobson: Could you test your QuickSurveys patch?
[21:51:29] <Jdlrobson>	 yep on it now
[21:53:24] <Jdlrobson>	 lgtm RoanKattouw 
[21:53:32] <logmsgbot>	 !log catrope@deploy2002 catrope, jdlrobson: Continuing with sync
[21:57:48] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199482|Update QuickSurvey platforms]] (duration: 09m 02s)
[21:59:57] <wikibugs>	 (03PS2) 10BryanDavis: wikitech: Put indicators in title with vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201816
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251110T2200).
[22:00:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikitech: Put indicators in title with vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201816 (owner: 10BryanDavis)
[22:03:58] <maryum>	 I have three security patches to deploy
[22:04:14] <RoanKattouw>	 Go ahead, I'll finish the rest of the backports after you're done
[22:05:32] <A_smart_kitten>	 maryum: just flagging the comment at https://phabricator.wikimedia.org/T407157#11361165 made a few mins ago
[22:05:33] <SomeRandomDev>	 maryum: please see T407157#11361165 in case you're planning to deploy that
[22:05:35] <SomeRandomDev>	 oh
[22:05:35] <wikibugs>	 (03PS3) 10BryanDavis: wikitech: Put indicators in title with vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201816
[22:05:40] <A_smart_kitten>	 lol, great minds think alike
[22:06:28] <maryum>	 I was planning to deploy that SomeRandomDev A_smart_kitten
[22:07:01] <maryum>	 does that mean that patch can't go out since the core MR is still open?
[22:07:09] <SomeRandomDev>	 yes
[22:07:45] <maryum>	 okay I'll check back Thursday which is the next window
[22:07:53] <SomeRandomDev>	 alright, thanks
[22:08:22] <maryum>	 appreciate the heads up
[22:12:11] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036#11361187 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[22:14:06] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:17:33] <maryum>	 SomeRandomDev when I try to apply your alternative3 patch for T406664, it's not working. I'll leave a comment there
[22:17:53] <SomeRandomDev>	 it's not mine, but I can take a look
[22:18:27] <maryum>	 thanks
[22:18:38] <maryum>	 yep I just realized you commented on it but didn't write it
[22:21:40] <wikibugs>	 (03PS3) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183)
[22:30:30] <wikibugs>	 (03PS4) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183)
[22:32:59] <maryum>	 preparing to run scap
[22:34:06] <wikibugs>	 (03PS1) 10Scott French: deployment_server: fully migrate mw-(api-ext|web) to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203559 (https://phabricator.wikimedia.org/T405955)
[22:36:57] <maryum>	 scap is running
[22:46:05] <maryum>	 scap is finished
[22:46:10] <maryum>	 !log Deployed fix for T406664 and T401053
[22:46:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:50:18] <RoanKattouw>	 maryum: Are you all done?
[22:50:24] <maryum>	 yes
[22:50:31] <RoanKattouw>	 Great, then I'll jump back in
[22:50:37] <maryum>	 enjoy
[22:51:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:55:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203097 (https://phabricator.wikimedia.org/T399749) (owner: 10Catrope)
[22:55:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/OATHAuth] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203126 (owner: 10Catrope)
[22:55:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203535 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles)
[22:56:11] <Jdlrobson>	 RoanKattouw: are you still able to do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1200173 ?
[22:56:16] <wikibugs>	 (03Merged) 10jenkins-bot: OATHAuth: Increase 2FA opt-in to 70% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203535 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles)
[22:56:17] <Jdlrobson>	 or can i do that quickly?
[22:56:20] <RoanKattouw>	 Yes I'll do that next
[22:56:23] <Jdlrobson>	 thx!
[22:56:55] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis)
[23:08:05] <wikibugs>	 (03Merged) 10jenkins-bot: i18n: Update wikimedia-emailauth-login-help to link to Special:AccountRecovery [extensions/WikimediaMessages] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203097 (https://phabricator.wikimedia.org/T399749) (owner: 10Catrope)
[23:08:06] <wikibugs>	 (03Merged) 10jenkins-bot: OATHManage: Don't always set the page title to "Create new recovery codes" [extensions/OATHAuth] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1203126 (owner: 10Catrope)
[23:08:28] <logmsgbot>	 !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1203097|i18n: Update wikimedia-emailauth-login-help to link to Special:AccountRecovery (T399749)]], [[gerrit:1203126|OATHManage: Don't always set the page title to "Create new recovery codes"]], [[gerrit:1203535|OATHAuth: Increase 2FA opt-in to 70% of users (T399664)]]
[23:08:33] <stashbot>	 T399749: Link to Zendesk form from EmailAuth failure message - https://phabricator.wikimedia.org/T399749
[23:08:34] <stashbot>	 T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664
[23:10:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11361376 (10BTullis) 05Open→03Resolved It's all done now. Apologies for the delay in getting to this.
[23:16:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11361406 (10BTullis) >>! In T408065#11361008, @Jclark-ctr wrote: > replaced failed drive bay 4.  idrac also  now has allert for  A pred...
[23:17:07] <icinga-wm>	 RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1203 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[23:34:11] <logmsgbot>	 !log catrope@deploy2002 catrope, mstyles: Backport for [[gerrit:1203097|i18n: Update wikimedia-emailauth-login-help to link to Special:AccountRecovery (T399749)]], [[gerrit:1203126|OATHManage: Don't always set the page title to "Create new recovery codes"]], [[gerrit:1203535|OATHAuth: Increase 2FA opt-in to 70% of users (T399664)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can no
[23:34:11] <logmsgbot>	 w be verified there.
[23:34:16] <stashbot>	 T399749: Link to Zendesk form from EmailAuth failure message - https://phabricator.wikimedia.org/T399749
[23:34:16] <stashbot>	 T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664
[23:37:12] <logmsgbot>	 !log catrope@deploy2002 catrope, mstyles: Continuing with sync
[23:39:11] <jinxer-wm>	 FIRING: [2x] PfwCoreBGPDown: Fundraising Firewall core BGP session down between pfw1-codfw and (null) (10.195.0.248) - group VPN - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown
[23:39:11] <jinxer-wm>	 FIRING: [2x] PfwCoreBGPDown: Fundraising Firewall core BGP session down between pfw1-codfw and (null) (10.195.0.248) - group VPN - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown
[23:39:46] <logmsgbot>	 !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860
[23:39:50] <stashbot>	 T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860
[23:55:52] <RoanKattouw>	 My deploy just failed after running for an hour :( https://spiderpig.wikimedia.org/jobs/887
[23:56:07] <RoanKattouw>	 `context deadline exceeded` well then
[23:56:12] <swfrench-wmf>	 RoanKattouw: that is very odd ... it looks like _only_ mw-wikifunctions timed out?
[23:56:20] <swfrench-wmf>	 and that triggered everything to roll back =/
[23:56:26] <RoanKattouw>	 Yeah it rolled back everything
[23:56:32] <swfrench-wmf>	 I'll take a quick look
[23:56:44] <swfrench-wmf>	 the good news is that retrying will be _much_ faster
[23:56:48] <RoanKattouw>	 Great
[23:56:55] <RoanKattouw>	 Would you like me to kick off that retry now?
[23:57:03] <RoanKattouw>	 Or would you like some time to take a look first?
[23:57:07] <swfrench-wmf>	 if it would be alright, give me a sec to see if I can sort out what happened
[23:57:59] <RoanKattouw>	 OK take your time