[00:00:07] (03PS1) 10Cwhite: scap: update beta-logs logstash host to use svc record [puppet] - 10https://gerrit.wikimedia.org/r/1208048 (https://phabricator.wikimedia.org/T409363) [00:04:12] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208022|ChangesListHooks: show entity titles in recent changes and watchlists (T406957)]] (duration: 10m 58s) [00:04:17] T406957: Show wish titles on lists (like on Wikidata) - https://phabricator.wikimedia.org/T406957 [00:04:26] okay, all done! [00:07:30] * bd808 looks at clock and decides to forge ahead [00:08:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201816 (owner: 10BryanDavis) [00:08:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203552 (https://phabricator.wikimedia.org/T409785) (owner: 10BryanDavis) [00:09:03] (03Merged) 10jenkins-bot: wikitech: Put indicators in title with vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201816 (owner: 10BryanDavis) [00:09:06] (03Merged) 10jenkins-bot: wikitech: Enable page protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203552 (https://phabricator.wikimedia.org/T409785) (owner: 10BryanDavis) [00:09:26] !log bd808@deploy2002 Started scap sync-world: Backport for [[gerrit:1201816|wikitech: Put indicators in title with vector-2022]], [[gerrit:1203552|wikitech: Enable page protection indicators (T409785)]] [00:09:30] T409785: Enable protection indicators for wikitech - https://phabricator.wikimedia.org/T409785 [00:13:47] !log bd808@deploy2002 bd808: Backport for [[gerrit:1201816|wikitech: Put indicators in title with vector-2022]], [[gerrit:1203552|wikitech: Enable page protection indicators (T409785)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:15:15] !log bd808@deploy2002 bd808: Continuing with sync [00:19:15] !log bd808@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201816|wikitech: Put indicators in title with vector-2022]], [[gerrit:1203552|wikitech: Enable page protection indicators (T409785)]] (duration: 09m 50s) [00:19:16] (03CR) 10RLazarus: [C:03+2] mesh.configuration: Copy 1.15.0 to 1.15.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202880 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [00:19:21] T409785: Enable protection indicators for wikitech - https://phabricator.wikimedia.org/T409785 [00:20:51] (03Merged) 10jenkins-bot: mesh.configuration: Copy 1.15.0 to 1.15.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202880 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [00:21:15] (03PS5) 10RLazarus: mesh.configuration: Envoy config updates for 1.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202881 (https://phabricator.wikimedia.org/T409510) [00:23:04] (03CR) 10RLazarus: [C:03+2] mesh.configuration: Envoy config updates for 1.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202881 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [00:24:41] (03Merged) 10jenkins-bot: mesh.configuration: Envoy config updates for 1.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202881 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [00:28:01] Wikitech has page protection indicators now. (T409785) [00:28:02] T409785: Enable protection indicators for wikitech - https://phabricator.wikimedia.org/T409785 [00:38:41] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11394649 (10Jdrewniak) As Artem noted, this is a static HTML/CSS/JS site, and we don’t plan to update it after deployment unless absolutely necessary. @BCornwall... [00:39:16] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [00:39:17] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:40:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1208056 [00:40:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1208056 (owner: 10TrainBranchBot) [00:44:16] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:44:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [00:45:15] I would like to have sitenotice enabled for mobile devices visiting Wikitech. Feedback welcome at T410702, especially if you have reasoned arguments against that change. [00:45:15] T410702: Enable sitenotice on mobile for Wikitech - https://phabricator.wikimedia.org/T410702 [00:52:04] (03CR) 10RLazarus: [C:03+2] api-gateway: Envoy config updates for 1.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203194 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [00:53:55] (03Merged) 10jenkins-bot: api-gateway: Envoy config updates for 1.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203194 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [00:54:16] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [00:55:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1208056 (owner: 10TrainBranchBot) [01:00:54] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T410589)', diff saved to https://phabricator.wikimedia.org/P85428 and previous config saved to /var/cache/conftool/dbconfig/20251121-010138-ladsgroup.json [01:01:43] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:04:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [01:09:16] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [01:09:17] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:10:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1208061 [01:10:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1208061 (owner: 10TrainBranchBot) [01:16:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P85429 and previous config saved to /var/cache/conftool/dbconfig/20251121-011646-ladsgroup.json [01:31:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P85430 and previous config saved to /var/cache/conftool/dbconfig/20251121-013153-ladsgroup.json [01:34:17] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [01:37:56] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1208061 (owner: 10TrainBranchBot) [01:44:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:47:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T410589)', diff saved to https://phabricator.wikimedia.org/P85431 and previous config saved to /var/cache/conftool/dbconfig/20251121-014701-ladsgroup.json [01:47:06] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:47:18] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [01:54:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:04:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:19:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:24:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:26:01] FIRING: [2x] ProbeDown: Service wdqs1025:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1025:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:29:16] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:34:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:49:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:54:17] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:59:16] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [03:09:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [03:11:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [03:11:51] Lumen (442550281) {#3867}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [03:12:42] o/ [03:12:46] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:12:50] !incidents [03:12:51] 7042 (UNACKED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [03:12:51] 7043 (UNACKED) Primary outbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [03:12:51] 7041 (RESOLVED) InboundMXQueueHigh sre (mx-in1001:9154 eqiad) [03:12:51] 7036 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [03:13:00] !ack 7042 [03:13:00] 7042 (ACKED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [03:13:02] !ack 7043 [03:13:02] 7043 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [03:14:25] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:25] RESOLVED: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:21:01] RESOLVED: [2x] ProbeDown: Service wdqs1025:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1025:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:21:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [03:21:51] Lumen (442550281) {#3867}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [03:22:46] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:24:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [03:29:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [03:37:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:37:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [03:37:51] Lumen (442550281) {#3867}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [03:38:06] !incidents [03:38:06] 7044 (UNACKED) Primary outbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [03:38:06] 7045 (UNACKED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [03:38:06] 7043 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [03:38:07] 7042 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [03:38:07] 7041 (RESOLVED) InboundMXQueueHigh sre (mx-in1001:9154 eqiad) [03:38:07] 7036 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [03:38:15] !ack 7044 [03:38:16] 7044 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [03:38:17] !ack 7045 [03:38:18] 7045 (ACKED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [03:52:45] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:52:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: ... [03:52:51] Lumen (442550281) {#3867}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [04:09:17] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [04:14:16] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [04:18:40] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860 [04:18:45] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [04:39:27] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:04:46] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:09:17] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:09:36] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:29:46] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:32:44] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:34:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:44:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:59:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:31:20] (03PS1) 10MusikAnimal: [mediawikiwiki] Enable CommunityRequests with translations only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208180 (https://phabricator.wikimedia.org/T405694) [06:35:45] (03PS2) 10MusikAnimal: [mediawikiwiki] Enable CommunityRequests with translations only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208180 (https://phabricator.wikimedia.org/T405694) [06:49:08] (03PS1) 10Kevin Bazira: httpbb: add post deployment tests for the revertrisk-wikidata endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1208189 (https://phabricator.wikimedia.org/T406179) [06:54:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:54:53] !incidents [06:54:54] 7045 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [06:54:54] 7044 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [06:54:54] 7043 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [06:54:54] 7042 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [06:54:55] 7041 (RESOLVED) InboundMXQueueHigh sre (mx-in1001:9154 eqiad) [06:54:55] 7036 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251121T0700) [07:12:52] (03CR) 10Ayounsi: [C:03+1] hiera: lvs/interfaces: remove VLAN sub-ints for edges [puppet] - 10https://gerrit.wikimedia.org/r/1207180 (https://phabricator.wikimedia.org/T410411) (owner: 10Ssingh) [07:21:28] (03PS6) 10Arnaudb: gerrit: add a local backup cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) [07:25:34] (03PS1) 10Muehlenhoff: hcaptcha: Enabled routed-compatible Bird in esams/magru [puppet] - 10https://gerrit.wikimedia.org/r/1208212 (https://phabricator.wikimedia.org/T409860) [07:26:03] (03CR) 10CI reject: [V:04-1] hcaptcha: Enabled routed-compatible Bird in esams/magru [puppet] - 10https://gerrit.wikimedia.org/r/1208212 (https://phabricator.wikimedia.org/T409860) (owner: 10Muehlenhoff) [07:27:46] (03PS2) 10Muehlenhoff: hcaptcha: Enabled routed-compatible Bird in esams/magru [puppet] - 10https://gerrit.wikimedia.org/r/1208212 (https://phabricator.wikimedia.org/T409860) [07:28:43] (03PS1) 10Bartosz Wójtowicz: ml-services: Add missing WIKI_URL env variable to Revise Tone model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208217 (https://phabricator.wikimedia.org/T408538) [07:32:26] (03CR) 10Muehlenhoff: [C:03+1] "Thus pulls in three additional fonts, but all within reason" [puppet] - 10https://gerrit.wikimedia.org/r/1208012 (owner: 10Alexandros Kosiaris) [07:34:09] (03Abandoned) 10Muehlenhoff: hcaptcha: Enabled routed-compatible Bird in esams/magru [puppet] - 10https://gerrit.wikimedia.org/r/1208212 (https://phabricator.wikimedia.org/T409860) (owner: 10Muehlenhoff) [07:37:12] (03PS2) 10Bartosz Wójtowicz: ml-services: Enable Changeprop for revise-tone-task-generator staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207829 (https://phabricator.wikimedia.org/T408538) [07:38:53] (03CR) 10Bartosz Wójtowicz: ml-services: Enable Changeprop for revise-tone-task-generator staging. (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207829 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [07:41:03] (03PS1) 10MusikAnimal: [metawiki] enable voting on entities with the 'Under review' status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) [07:41:09] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [07:44:11] (03CR) 10MusikAnimal: [metawiki] enable voting on entities with the 'Under review' status (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) (owner: 10MusikAnimal) [07:46:14] (03CR) 10Dpogorzelski: [C:03+1] ml-services: Add missing WIKI_URL env variable to Revise Tone model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208217 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [07:47:21] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Add missing WIKI_URL env variable to Revise Tone model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208217 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [07:49:08] (03Merged) 10jenkins-bot: ml-services: Add missing WIKI_URL env variable to Revise Tone model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208217 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [07:50:55] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [07:51:27] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [07:58:40] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251121T0800) [08:07:32] (03CR) 10Slyngshede: [C:03+1] "Key verified out of band." [puppet] - 10https://gerrit.wikimedia.org/r/1207863 (owner: 10Marostegui) [08:09:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:10:40] (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs::cloudgw: Cleanup natlog feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1207913 (owner: 10Majavah) [08:10:43] (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs::cloudgw: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1207912 (owner: 10Majavah) [08:14:27] FIRING: [2x] CalicoHighMemoryUsage: Calico container calico-node-pvjjr:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [08:17:03] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Do not pass callback arguments to incompatible method [extensions/GlobalPreferences] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207950 (https://phabricator.wikimedia.org/T410551) (owner: 10Brennen Bearnes) [08:17:19] (03CR) 10AikoChou: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207829 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [08:22:58] (03PS2) 10Itamar Givon: Replace 'let' with arithmetic expansion [dumps] - 10https://gerrit.wikimedia.org/r/1207109 (https://phabricator.wikimedia.org/T406044) [08:23:23] (03CR) 10Itamar Givon: Replace 'let' with arithmetic expansion (032 comments) [dumps] - 10https://gerrit.wikimedia.org/r/1207109 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon) [08:25:02] (03PS2) 10Itamar Givon: Clean up existing symlink before creating a new one [dumps] - 10https://gerrit.wikimedia.org/r/1207110 (https://phabricator.wikimedia.org/T406044) [08:25:12] (03PS2) 10Itamar Givon: Restore strict error handling [dumps] - 10https://gerrit.wikimedia.org/r/1207111 (https://phabricator.wikimedia.org/T406044) [08:25:38] 06SRE, 06cloud-services-team, 07Upstream: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11394895 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm calling this one done since we have a workaround in place; I'll followup on the Deb... [08:26:24] (03CR) 10Filippo Giunchedi: "Will merge next week" [puppet] - 10https://gerrit.wikimedia.org/r/1207743 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [08:26:38] (03CR) 10Filippo Giunchedi: "Will merge next week" [puppet] - 10https://gerrit.wikimedia.org/r/1207742 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [08:26:46] (03CR) 10Filippo Giunchedi: "Will merge next week" [puppet] - 10https://gerrit.wikimedia.org/r/1207741 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [08:26:54] (03CR) 10Filippo Giunchedi: "Will merge next week" [puppet] - 10https://gerrit.wikimedia.org/r/1207740 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [08:27:02] (03CR) 10Filippo Giunchedi: "Will merge next week" [puppet] - 10https://gerrit.wikimedia.org/r/1207739 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [08:32:43] (03CR) 10Gehel: [C:03+1] A modest proposal: run oomd on stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1203548 (owner: 10CDanis) [08:36:15] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts cumin1002.eqiad.wmnet [08:37:02] (03PS4) 10Arnaudb: apt: add an alert on reprepro errors [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) [08:37:02] (03CR) 10Arnaudb: "done!" [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [08:39:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [08:39:27] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:41:02] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11394924 (10MoritzMuehlenhoff) [08:41:03] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11394925 (10TheDJ) >>! In T408592#11394649, @Jdrewniak wrote: > User interaction is essentially limited to loading and scrolling, so I don’t see any meaningful sec... [08:41:12] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:41:27] (03PS3) 10Itamar Givon: Clean up existing symlink before creating a new one [dumps] - 10https://gerrit.wikimedia.org/r/1207110 (https://phabricator.wikimedia.org/T406044) [08:41:27] (03PS3) 10Itamar Givon: Restore strict error handling [dumps] - 10https://gerrit.wikimedia.org/r/1207111 (https://phabricator.wikimedia.org/T406044) [08:42:34] (03CR) 10Itamar Givon: Clean up existing symlink before creating a new one (032 comments) [dumps] - 10https://gerrit.wikimedia.org/r/1207110 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon) [08:43:18] (03PS4) 10Itamar Givon: Clean up existing symlink before creating a new one [dumps] - 10https://gerrit.wikimedia.org/r/1207110 (https://phabricator.wikimedia.org/T406044) [08:43:27] (03PS4) 10Itamar Givon: Restore strict error handling [dumps] - 10https://gerrit.wikimedia.org/r/1207111 (https://phabricator.wikimedia.org/T406044) [08:44:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [08:44:58] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@f3216ec] (releasing): testing issue with instance [08:46:47] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@f3216ec] (releasing): testing issue with instance (duration: 01m 48s) [08:46:57] jmm@cumin2002 decommission (PID 1672472) is awaiting input [09:02:09] 06SRE, 06Infrastructure-Foundations, 10netops: mr1-codfw is single-homed to lsw1-a2-codfw - https://phabricator.wikimedia.org/T407488#11394993 (10ayounsi) management routers are physically single homed in the old design (eqsin, codfw, eqiad), probably because it was best to not over engineer it, and mgmt net... [09:09:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:13:04] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cumin1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:16:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cumin1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:16:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:16:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cumin1002.eqiad.wmnet [09:19:10] (03PS1) 10Bartosz Wójtowicz: ml-services: Update image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208280 (https://phabricator.wikimedia.org/T408538) [09:21:59] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice-archive: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11395040 (10Guycn2) This seems to be blocking legitimate web captures by the Internet Archive (see, for example, [[ https://web.archive.org/web... [09:26:28] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn) [09:27:16] (03CR) 10Jcrespo: [C:03+2] garage: Productionize garage [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [09:28:08] (03CR) 10AikoChou: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208280 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [09:28:43] (03CR) 10Muehlenhoff: "I'm not sure this is actually beneficial? In people run into actual bugs, allowing them to send the stacktrace is useful. And the perceive" [puppet] - 10https://gerrit.wikimedia.org/r/1207874 (owner: 10Slyngshede) [09:29:25] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208280 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [09:30:30] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11395055 (10MoritzMuehlenhoff) >>! In T410612#11393300, @Clement_Goubert wrote: > @KOfori Could you approve this ? This needs approval by @mark or @Kappakayala (see the approval:... [09:31:16] (03Merged) 10jenkins-bot: ml-services: Update image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208280 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [09:31:27] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Enable Changeprop for revise-tone-task-generator staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207829 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [09:33:04] (03PS3) 10Gehel: Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) [09:33:19] (03Merged) 10jenkins-bot: ml-services: Enable Changeprop for revise-tone-task-generator staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207829 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [09:34:14] (03CR) 10CI reject: [V:04-1] Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel) [09:34:17] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [09:37:36] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:38:38] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11395092 (10ayounsi) New updated list, we're at 642 hosts if the MySQL query from the initial task is still the proper way. Up from 90 in 2020, probably becaus... [09:39:19] (03PS4) 10Gehel: Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) [09:40:30] (03CR) 10CI reject: [V:04-1] Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel) [09:43:46] I'm clearly not understanding what I'm doing (^). It looks like there are weekly Observability office hours. Could you invite me? https://office.wikimedia.org/wiki/Team_interfaces/SRE_-_Observability/Contact [09:44:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:37] !log dpogorzelski@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [09:45:45] !log dpogorzelski@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [09:50:49] (03CR) 10MSantos: [C:03+1] admin: transfer group approver for releasers-mediawiki to Mateus Santos [puppet] - 10https://gerrit.wikimedia.org/r/1207304 (owner: 10Dzahn) [09:59:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [10:04:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [10:05:32] (03CR) 10Majavah: [C:03+2] P:wmcs::cloudgw: Cleanup parameter types [puppet] - 10https://gerrit.wikimedia.org/r/1207912 (owner: 10Majavah) [10:05:39] (03CR) 10Majavah: [C:03+2] P:wmcs::cloudgw: Cleanup natlog feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1207913 (owner: 10Majavah) [10:06:35] (03PS2) 10Majavah: P:wmcs::cloudgw: Cleanup natlog feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1207913 [10:09:25] (03CR) 10Majavah: [C:03+2] P:wmcs::cloudgw: Cleanup natlog feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1207913 (owner: 10Majavah) [10:16:52] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@a809ec3] (releasing): T410680 [10:16:57] T410680: scap publish-docs job failing - https://phabricator.wikimedia.org/T410680 [10:17:08] (03PS1) 10Majavah: hieradata: cloudgw: Fix inconsistent use of network masks [puppet] - 10https://gerrit.wikimedia.org/r/1208290 [10:18:15] (03CR) 10Majavah: [C:03+2] hieradata: cloudgw: Fix inconsistent use of network masks [puppet] - 10https://gerrit.wikimedia.org/r/1208290 (owner: 10Majavah) [10:19:06] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@a809ec3] (releasing): T410680 (duration: 02m 13s) [10:20:44] (03PS5) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [10:21:05] (03PS1) 10Ayounsi: Network report: Remove clusters from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1208291 (https://phabricator.wikimedia.org/T253173) [10:21:07] (03PS1) 10Anzx: Revert "tcywikisource: throttle.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208292 [10:22:12] (03PS6) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [10:24:13] (03PS2) 10Anzx: Revert "tcywikisource: throttle exception" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208292 (https://phabricator.wikimedia.org/T410507) [10:25:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208292 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [10:26:20] (03PS3) 10Blake: Add a node_file_age to compare to broker process uptime. [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552) [10:27:33] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11395183 (10MatthewVernon) Thanks! I've spent a fair chunk of time searching and have come up with nothing. My next stop is likely #no-stu... [10:28:13] (03PS5) 10Gehel: Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) [10:29:24] (03CR) 10CI reject: [V:04-1] Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel) [10:29:53] (03PS1) 10Majavah: interface::tagged: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1208293 [10:30:15] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11395193 (10MatthewVernon) I have solved the easy one, though: ecosia. If you image search on there (e.g. https://www.ecosia.org/images?q=... [10:31:20] (03PS7) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [10:32:07] (03PS6) 10Gehel: Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) [10:32:29] (03CR) 10CI reject: [V:04-1] interface::tagged: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1208293 (owner: 10Majavah) [10:33:34] (03CR) 10Lucas Werkmeister (WMDE): Revert "tcywikisource: throttle exception" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208292 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [10:33:41] (03PS7) 10Gehel: Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) [10:35:05] (03PS3) 10Anzx: Revert "tcywikisource: throttle exception" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208292 (https://phabricator.wikimedia.org/T410507) [10:35:35] (03PS8) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [10:36:29] (03PS9) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [10:36:44] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [10:37:20] (03CR) 10Anzx: Revert "tcywikisource: throttle exception" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208292 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [10:41:32] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Okay to deploy on Monday as scheduled :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208292 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [10:42:18] !log ayounsi@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1005.eqiad.wmnet [10:43:37] (03PS2) 10Majavah: interface::tagged: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1208293 [10:43:37] (03PS1) 10Majavah: network: Add type to validate VLAN tags [puppet] - 10https://gerrit.wikimedia.org/r/1208297 [10:43:37] (03PS1) 10Majavah: P:openstack: Handle VLAN tags as numbers [puppet] - 10https://gerrit.wikimedia.org/r/1208298 [10:43:38] (03PS1) 10Majavah: hieradata: lvs: Store VLAN tags as numbers [puppet] - 10https://gerrit.wikimedia.org/r/1208299 [10:45:31] ayounsi@cumin1003 upgrade-firmware (PID 4061773) is awaiting input [10:46:31] (03PS10) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [10:46:45] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [10:47:10] (03CR) 10CI reject: [V:04-1] P:openstack: Handle VLAN tags as numbers [puppet] - 10https://gerrit.wikimedia.org/r/1208298 (owner: 10Majavah) [10:47:57] (03CR) 10CI reject: [V:04-1] interface::tagged: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1208293 (owner: 10Majavah) [10:54:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [10:54:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:55:00] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops for blake - https://phabricator.wikimedia.org/T410612#11395246 (10Volans) @MoritzMuehlenhoff yes but I think Kwaku is covering for Kavitha while she's away in the next few days. [10:56:31] (03PS11) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [10:56:42] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [10:56:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 26 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7670" [puppet] - 10https://gerrit.wikimedia.org/r/1208293 (owner: 10Majavah) [11:04:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [11:04:21] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:04:45] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:04:57] (03PS8) 10Gehel: Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) [11:05:22] (03CR) 10Gkyziridis: [C:03+1] "I am not familiar with this repo, although it looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1208189 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [11:05:42] (03PS9) 10Gehel: Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) [11:07:25] (03CR) 10Muehlenhoff: [C:03+1] "I can't speak for the rest, but I can confirm that maps and Ganeti are fully migrated to proper 4+6." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1208291 (https://phabricator.wikimedia.org/T253173) (owner: 10Ayounsi) [11:09:57] (03PS12) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [11:11:33] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [11:12:26] (03CR) 10Brouberol: [C:03+1] "Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel) [11:13:46] 06SRE, 06Infrastructure-Foundations, 10netops: mr1-codfw is single-homed to lsw1-a2-codfw - https://phabricator.wikimedia.org/T407488#11395278 (10cmooney) >>! In T407488#11394993, @ayounsi wrote: > management routers are physically single homed in the old design (eqsin, codfw, eqiad), probably because it was... [11:15:59] (03CR) 10JMeybohm: [C:03+2] P:conftool::requestctl_client: update requestctl_cli.original.py [puppet] - 10https://gerrit.wikimedia.org/r/1207844 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [11:18:46] (03PS13) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [11:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95152299 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:21:40] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [11:23:09] (03CR) 10Klausman: [C:03+2] httpbb: add post deployment tests for the revertrisk-wikidata endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1208189 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [11:24:43] FIRING: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:24:56] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice-archive: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11395297 (10daniel) >>! In T400119#11395040, @Guycn2 wrote: > This seems to be blocking legitimate web captures by the Internet Archive (see, f... [11:26:16] (03PS2) 10Majavah: network: Add type to validate VLAN tags [puppet] - 10https://gerrit.wikimedia.org/r/1208297 [11:26:16] (03PS2) 10Majavah: P:openstack: Handle VLAN tags as numbers [puppet] - 10https://gerrit.wikimedia.org/r/1208298 [11:26:16] (03PS2) 10Majavah: hieradata: lvs: Store VLAN tags as numbers [puppet] - 10https://gerrit.wikimedia.org/r/1208299 [11:26:17] (03PS3) 10Majavah: interface::tagged: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1208293 [11:26:18] (03PS1) 10Majavah: P:cloudceph::osd: Set storage VLAN mapping in tests [puppet] - 10https://gerrit.wikimedia.org/r/1208305 [11:26:19] (03PS1) 10Majavah: P:openstack: neutron: Cleanup legacy_vlan_naming hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1208306 [11:26:23] (03PS1) 10Majavah: interface::tagged: Remove legacy_vlan_naming option [puppet] - 10https://gerrit.wikimedia.org/r/1208307 [11:26:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add connection from - https://phabricator.wikimedia.org/T410717 (10cmooney) 03NEW p:05Triage→03Low [11:28:16] (03PS14) 10Jcrespo: garage: Add a first role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) [11:29:31] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [11:36:07] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7671/console" [puppet] - 10https://gerrit.wikimedia.org/r/1208298 (owner: 10Majavah) [11:38:27] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7674/console" [puppet] - 10https://gerrit.wikimedia.org/r/1208299 (owner: 10Majavah) [11:38:45] (03CR) 10Majavah: hieradata: lvs: Store VLAN tags as numbers [puppet] - 10https://gerrit.wikimedia.org/r/1208299 (owner: 10Majavah) [11:40:07] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [11:44:15] 06SRE, 06Infrastructure-Foundations, 10netops: mr1-codfw is single-homed to lsw1-a2-codfw - https://phabricator.wikimedia.org/T407488#11395392 (10cmooney) 05Open→03Resolved a:03cmooney Will open task on getting the second link in place on mr1-codfw. We can look to do the same in eqiad once the row... [11:44:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [11:45:45] (03CR) 10Majavah: [C:03+2] toolforge: Add redis-tools to bastions [puppet] - 10https://gerrit.wikimedia.org/r/1208023 (https://phabricator.wikimedia.org/T410102) (owner: 10BryanDavis) [11:46:54] (03CR) 10Jcrespo: "Is lookup('hash.key') puppet >= 6 compatible only? is this acceptable?" [puppet] - 10https://gerrit.wikimedia.org/r/1207887 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [11:48:27] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 26 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7673" [puppet] - 10https://gerrit.wikimedia.org/r/1208293 (owner: 10Majavah) [11:49:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [11:49:32] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7675/console" [puppet] - 10https://gerrit.wikimedia.org/r/1208306 (owner: 10Majavah) [11:51:30] (03PS1) 10Bartosz Wójtowicz: ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208310 (https://phabricator.wikimedia.org/T408538) [12:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251121T0800) [12:00:05] jelto, arnoldokoth, and mutante: May I have your attention please! GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251121T1200) [12:00:54] (03CR) 10FNegri: [C:03+1] network: Add type to validate VLAN tags [puppet] - 10https://gerrit.wikimedia.org/r/1208297 (owner: 10Majavah) [12:02:24] (03CR) 10FNegri: [C:03+1] P:openstack: Handle VLAN tags as numbers [puppet] - 10https://gerrit.wikimedia.org/r/1208298 (owner: 10Majavah) [12:04:30] (03CR) 10AikoChou: [C:03+1] ml-services: Update the image for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208310 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:05:12] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 38): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7676/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1208307 (owner: 10Majavah) [12:07:06] (03CR) 10FNegri: [C:03+1] P:cloudceph::osd: Set storage VLAN mapping in tests [puppet] - 10https://gerrit.wikimedia.org/r/1208305 (owner: 10Majavah) [12:07:29] (03CR) 10Majavah: [C:03+2] network: Add type to validate VLAN tags [puppet] - 10https://gerrit.wikimedia.org/r/1208297 (owner: 10Majavah) [12:07:38] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: Handle VLAN tags as numbers [puppet] - 10https://gerrit.wikimedia.org/r/1208298 (owner: 10Majavah) [12:07:47] (03CR) 10Majavah: [C:03+2] P:cloudceph::osd: Set storage VLAN mapping in tests [puppet] - 10https://gerrit.wikimedia.org/r/1208305 (owner: 10Majavah) [12:08:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11395518 (10cmooney) [12:10:57] (03CR) 10Cathal Mooney: [C:03+2] Machine Learning beast servers: allow BGP to alternate rack [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1204072 (owner: 10Cathal Mooney) [12:11:45] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:12:21] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:15:01] (03PS2) 10Ssingh: plugins/wmf-netbox: add hcaptcha-proxy VMs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1207915 (https://phabricator.wikimedia.org/T409780) [12:18:27] (03CR) 10Cathal Mooney: [C:03+2] plugins/wmf-netbox: add hcaptcha-proxy VMs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1207915 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [12:19:43] RESOLVED: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:22:51] (03PS1) 10Btullis: Add a new spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) [12:22:53] (03PS1) 10Btullis: Enable the spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) [12:22:55] (03PS1) 10Btullis: Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) [12:24:55] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Homer release v0.11.0 minor update - cmooney@cumin1003 [12:26:16] (03PS1) 10Esanders: Enable DiscussionTools visual enhancements on ruwiki & svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208320 (https://phabricator.wikimedia.org/T379264) [12:26:32] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Homer release v0.11.0 minor update - cmooney@cumin1003 [12:26:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208320 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders) [12:29:37] (03PS1) 10Btullis: Add k8s tokens for the analytics namespace [puppet] - 10https://gerrit.wikimedia.org/r/1208321 (https://phabricator.wikimedia.org/T410017) [12:31:29] (03PS2) 10Btullis: Add a new spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) [12:31:29] (03PS2) 10Btullis: Enable the spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) [12:31:29] (03PS2) 10Btullis: Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) [12:35:16] (03PS3) 10Btullis: Add a new deploy-spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) [12:35:16] (03PS3) 10Btullis: Enable the deploy-spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) [12:35:16] (03PS3) 10Btullis: Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) [12:39:27] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:42:39] (03PS2) 10Btullis: Add k8s tokens for the analytics namespace [puppet] - 10https://gerrit.wikimedia.org/r/1208321 (https://phabricator.wikimedia.org/T410017) [12:45:09] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7678/co" [puppet] - 10https://gerrit.wikimedia.org/r/1208321 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [12:45:19] (03PS4) 10Btullis: Add a new deploy-spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) [12:45:19] (03PS4) 10Btullis: Enable the deploy-spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) [12:45:19] (03PS4) 10Btullis: Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) [12:52:07] (03CR) 10CI reject: [V:04-1] Enable the deploy-spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [12:52:12] (03CR) 10CI reject: [V:04-1] Add a new deploy-spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [12:52:35] (03CR) 10Gehel: [C:03+2] Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel) [12:52:49] (03CR) 10CI reject: [V:04-1] Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [12:53:50] (03Merged) 10jenkins-bot: Hive: alert when query rate is too high [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel) [12:59:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [13:09:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:09:57] (03PS1) 10Dragoniez: rowiki: Redefine AbuseFilter permission model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208329 (https://phabricator.wikimedia.org/T407978) [13:12:05] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1007.eqiad.wmnet [13:14:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [13:14:18] (03PS2) 10D3r1ck01: tests: Make data providers static methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208328 (https://phabricator.wikimedia.org/T410731) [13:18:39] (03CR) 10Dragoniez: "Please double-check the setup" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208329 (https://phabricator.wikimedia.org/T407978) (owner: 10Dragoniez) [13:19:00] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1007.eqiad.wmnet [13:24:12] (03PS5) 10Btullis: Add a new deploy-spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) [13:24:12] (03PS5) 10Btullis: Enable the deploy-spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) [13:24:12] (03PS5) 10Btullis: Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) [13:24:27] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [13:26:44] (03PS1) 10Muehlenhoff: Remove access for west1 [puppet] - 10https://gerrit.wikimedia.org/r/1208332 [13:27:25] (03CR) 10CI reject: [V:04-1] Remove access for west1 [puppet] - 10https://gerrit.wikimedia.org/r/1208332 (owner: 10Muehlenhoff) [13:29:53] (03PS6) 10Btullis: Add a new deploy-spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) [13:29:53] (03PS6) 10Btullis: Enable the deploy-spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) [13:29:53] (03PS6) 10Btullis: Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) [13:30:25] (03CR) 10Urbanecm: [C:04-2] "This seems to strictly depend on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1206927 (specifically, it requi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207133 (https://phabricator.wikimedia.org/T407431) (owner: 10Cyndywikime) [13:35:51] (03PS2) 10Muehlenhoff: Remove access for west1 [puppet] - 10https://gerrit.wikimedia.org/r/1208332 [13:36:16] (03PS1) 10Dragoniez: jawiki: Disallow sysops from granting temporary-account-viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) [13:39:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez) [13:42:57] (03CR) 10Slyngshede: [C:03+1] Remove access for west1 [puppet] - 10https://gerrit.wikimedia.org/r/1208332 (owner: 10Muehlenhoff) [13:44:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:49] (03CR) 10Dragoniez: jawiki: Disallow sysops from granting temporary-account-viewer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208334 (https://phabricator.wikimedia.org/T409687) (owner: 10Dragoniez) [13:48:48] (03CR) 10Ayounsi: [C:03+2] Network report: Remove clusters from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1208291 (https://phabricator.wikimedia.org/T253173) (owner: 10Ayounsi) [13:51:02] (03Merged) 10jenkins-bot: Network report: Remove clusters from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1208291 (https://phabricator.wikimedia.org/T253173) (owner: 10Ayounsi) [13:52:16] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [13:52:46] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [13:56:15] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11395909 (10MatthewVernon) [14:03:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool pc6 (T405942)', diff saved to https://phabricator.wikimedia.org/P85435 and previous config saved to /var/cache/conftool/dbconfig/20251121-140327-ladsgroup.json [14:03:32] T405942: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942 [14:05:10] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on pc1016.eqiad.wmnet with reason: Maint [14:05:22] (03PS1) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [14:05:33] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on pc2016.codfw.wmnet with reason: Maint [14:05:56] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [14:08:18] (03PS2) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [14:08:24] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [14:08:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11395967 (10Jclark-ctr) pc1016 has been moved with @Ladsgroup Thanks for your help this morning [14:09:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repool pc6 (T405942)', diff saved to https://phabricator.wikimedia.org/P85436 and previous config saved to /var/cache/conftool/dbconfig/20251121-140903-ladsgroup.json [14:09:09] T405942: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942 [14:09:17] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11395970 (10MatthewVernon) [14:13:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool pc7 (T405942)', diff saved to https://phabricator.wikimedia.org/P85437 and previous config saved to /var/cache/conftool/dbconfig/20251121-141345-ladsgroup.json [14:13:53] !log homer "cr*eqiad*" commit "bring up hcaptcha-proxy100[12]": T409780 [14:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [14:14:26] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on pc1017.eqiad.wmnet with reason: Maint [14:14:45] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on pc2017.codfw.wmnet with reason: Maint [14:15:55] (03CR) 10Mszwarc: Enable v2 non-emergency workflow by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [14:16:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11396000 (10Ladsgroup) Since the depool time was quite short, the latency immediately recovered so we are moving forward to pc7 and pc8 too. [14:17:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repool pc7 (T405942)', diff saved to https://phabricator.wikimedia.org/P85438 and previous config saved to /var/cache/conftool/dbconfig/20251121-141747-ladsgroup.json [14:17:52] T405942: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942 [14:18:24] !log homer "cr*codfw*" commit "bring up hcaptcha-proxy200[12]": T409780 [14:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:48] (03CR) 10Bking: [C:03+2] A modest proposal: run oomd on stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1203548 (owner: 10CDanis) [14:20:15] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11396006 (10ayounsi) The current cable is already to lsw1-a2 (https://netbox.wikimedia.org/dcim/cables/7147/) so probably a3 is the next one. To... [14:21:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool pc8 (T405942)', diff saved to https://phabricator.wikimedia.org/P85439 and previous config saved to /var/cache/conftool/dbconfig/20251121-142059-ladsgroup.json [14:21:33] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2018.codfw.wmnet with reason: Maint [14:21:38] (03CR) 10Muehlenhoff: [C:03+2] Remove access for west1 [puppet] - 10https://gerrit.wikimedia.org/r/1208332 (owner: 10Muehlenhoff) [14:21:52] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1018.eqiad.wmnet with reason: Maint [14:22:16] (03PS3) 10Michael Große: testwiki: enable ReviseTone experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207886 (https://phabricator.wikimedia.org/T407029) [14:22:16] (03CR) 10Michael Große: [C:04-1] "Still do not merge, the data is not there yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207886 (https://phabricator.wikimedia.org/T407029) (owner: 10Michael Große) [14:22:32] (03PS1) 10Michael Große: Growth: Enable Revise Tone feature on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) [14:22:32] (03CR) 10Michael Große: [C:04-1] "Still do not merge, the data is not there yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) (owner: 10Michael Große) [14:23:11] !log homer "cr*ulsfo*" commit "bring up hcaptcha-proxy400[12]": T409780 [14:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:45] (03PS1) 10Andrew Bogott: idp_clouddev: duplicate private settings from cloudweb2002-dev [labs/private] - 10https://gerrit.wikimedia.org/r/1208359 (https://phabricator.wikimedia.org/T410294) [14:24:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [14:24:42] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1008.eqiad.wmnet with OS bookworm [14:24:51] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] idp_clouddev: duplicate private settings from cloudweb2002-dev [labs/private] - 10https://gerrit.wikimedia.org/r/1208359 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [14:25:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repool pc8 (T405942)', diff saved to https://phabricator.wikimedia.org/P85440 and previous config saved to /var/cache/conftool/dbconfig/20251121-142500-ladsgroup.json [14:25:06] T405942: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942 [14:25:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11396019 (10cmooney) >>! In T410717#11396006, @ayounsi wrote: > The current cable is already to lsw1-a2 (https://netbox.wikimedia.org/dcim/cables... [14:25:27] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1009.eqiad.wmnet with OS bookworm [14:25:28] (03PS3) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [14:25:30] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [14:25:39] !log homer "cr*eqsin*" commit "bring up hcaptcha-proxy500[12]": T409780 [14:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:14] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging West1 out of all services on: 2410 hosts [14:26:58] sukhe: [14:27:04] topranks: [14:27:04] https://www.irccloud.com/pastebin/3paJ5myl/ [14:27:05] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS bookworm [14:27:48] topranks: looks good! [14:27:57] IPv6 is down though [14:28:02] topranks: we never set that up [14:28:15] the anycast v6 setup is really well [14:28:19] we only do that in durum [14:28:20] thoughts? [14:28:28] well you set it up on the router :) [14:28:33] is really not well tested [14:28:43] topranks: ah so I think we should have set only ipv4 in the change? [14:28:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11396028 (10Ladsgroup) https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&from=2025-11-21T13:37:29.200Z&to=2025-11-21T14:28:00.3... [14:29:07] 'hcaptcha-proxy': {'group': 'anycast'}, [14:29:11] should really be ipv4_only: True? [14:29:22] that should take care of it? [14:29:32] or we can also do v6, we already do for durum so [14:29:39] I was about to say I've no idea how we control that [14:29:43] but yes I think you found it [14:29:51] ok, I will patch again [14:29:51] better to do v6 as well yes [14:29:54] ok [14:29:57] will do after this meeting [14:30:00] depends how tricky either is [14:30:01] no probs [14:30:03] (03PS4) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [14:30:12] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [14:30:23] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1009.eqiad.wmnet with OS bookworm [14:30:41] (03PS1) 10Ayounsi: Remove maps from SKIP_V6_DNS_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1208360 [14:32:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11396048 (10Jclark-ctr) 05Open→03Resolved a:05Marostegui→03Jclark-ctr all hosts listed on this task have been migrated. [14:32:29] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11396055 (10taavi) Noting here that I filed {T410465} about this codebase. [14:32:35] (03CR) 10Muehlenhoff: cloudidp-dev: Hiera changes to make more like normal idp nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [14:34:30] (03CR) 10Muehlenhoff: [C:03+1] "Looks good (the role itself has been adapted to proper IPV6 (initially the test systems were named maps-test, which didn't match that filt" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1208360 (owner: 10Ayounsi) [14:34:44] bking@cumin2002 reimage (PID 1841351) is awaiting input [14:35:45] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1008.eqiad.wmnet with OS bookworm [14:36:12] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11396060 (10Ladsgroup) I haven't found anything in gadgets, etc. https://commons.wikimedia.org/w/index.php?title=Special:Search&limit=500&... [14:36:20] (03CR) 10Blake: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake) [14:36:45] (03PS5) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [14:37:12] (03CR) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [14:37:48] (03PS6) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [14:38:02] (03CR) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [14:38:07] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [14:39:47] bking@cumin2002 reimage (PID 1850645) is awaiting input [14:40:59] PROBLEM - MD RAID on ganeti1039 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 2, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:41:01] ACKNOWLEDGEMENT - MD RAID on ganeti1039 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T410743 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:41:08] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743 (10ops-monitoring-bot) 03NEW [14:41:47] (03PS7) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [14:41:50] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [14:41:59] (03PS1) 10Muehlenhoff: cas: Remove idp::memcached [puppet] - 10https://gerrit.wikimedia.org/r/1208362 [14:42:31] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1159.eqiad.wmnet with reason: Maintenance [14:42:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T410589)', diff saved to https://phabricator.wikimedia.org/P85441 and previous config saved to /var/cache/conftool/dbconfig/20251121-144238-ladsgroup.json [14:42:44] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [14:48:24] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1009.eqiad.wmnet with OS bookworm [14:48:57] !log homer "asw*drmrs*" commit "bring up hcaptcha-proxy600[12]": T409780 [14:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208362 (owner: 10Muehlenhoff) [14:52:32] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1207791 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [14:52:34] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11396143 (10cmooney) Just to update this we are not planning to move the IP gateways for vlans to the Nokia switches in t... [14:52:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632#11396146 (10cmooney) Just to update this we are not planning to move the IP gateways for vlans to the Nok... [14:54:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:55:30] (03CR) 10Brouberol: Add a new deploy-spark-support clusterrole (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:55:42] (03CR) 10Brouberol: [C:03+1] Enable the deploy-spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:55:50] (03CR) 10Brouberol: [C:03+1] Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:58:22] (03Abandoned) 10Alexandros Kosiaris: WIP: Move monitoring stanzas to shared templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/647660 (owner: 10Alexandros Kosiaris) [14:58:22] (03Abandoned) 10Alexandros Kosiaris: Revert "Remove the portforward right from deploy role" [deployment-charts] - 10https://gerrit.wikimedia.org/r/544157 (https://phabricator.wikimedia.org/T235821) (owner: 10Alexandros Kosiaris) [14:58:22] (03Abandoned) 10Alexandros Kosiaris: coredns: Keep traffic node local if possible [deployment-charts] - 10https://gerrit.wikimedia.org/r/537061 (owner: 10Alexandros Kosiaris) [14:58:22] (03Abandoned) 10Alexandros Kosiaris: Experiment with a jenkins-debug user [deployment-charts] - 10https://gerrit.wikimedia.org/r/722869 (https://phabricator.wikimedia.org/T290360) (owner: 10Alexandros Kosiaris) [14:58:25] (03Abandoned) 10Alexandros Kosiaris: ATS: Switch blubberoid to discovery records [puppet] - 10https://gerrit.wikimedia.org/r/538162 (owner: 10Alexandros Kosiaris) [14:58:29] (03Abandoned) 10Alexandros Kosiaris: Add reprepo updates for cassandra311 [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris) [14:59:09] (03PS1) 10Tiziano Fogli: Blackbox/check: strengthen suffix matching regex in generated rules [puppet] - 10https://gerrit.wikimedia.org/r/1208365 (https://phabricator.wikimedia.org/T410745) [15:04:00] (03PS8) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [15:04:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [15:04:26] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [15:06:20] (03PS4) 10Blake: Add a node_file_age to compare to broker process uptime. [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552) [15:08:37] (03CR) 10Blake: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207879 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake) [15:09:17] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:43] (03PS9) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [15:09:43] (03PS1) 10Andrew Bogott: profile::idp: require many args to be non-empty [puppet] - 10https://gerrit.wikimedia.org/r/1208370 [15:09:57] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: split targets into directories by source [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) (owner: 10Cwhite) [15:11:11] (03CR) 10Btullis: Add a new deploy-spark-support clusterrole (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [15:11:13] (03PS2) 10Andrew Bogott: profile::idp: require many args to be non-empty [puppet] - 10https://gerrit.wikimedia.org/r/1208370 [15:11:13] (03PS10) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [15:11:29] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208370 (owner: 10Andrew Bogott) [15:12:03] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [15:14:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [15:16:11] (03CR) 10Alexandros Kosiaris: [C:03+2] "Indeed it does. And it might make sense to ship some nerdfont from https://www.nerdfonts.com/ at some point, but that's for latter." [puppet] - 10https://gerrit.wikimedia.org/r/1208012 (owner: 10Alexandros Kosiaris) [15:16:45] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208013 (owner: 10Alexandros Kosiaris) [15:18:30] 06SRE, 10SRE-Access-Requests: Requesting access to ml-lab-users for ttaylor - https://phabricator.wikimedia.org/T410752 (10DPogorzelski-WMF) 03NEW [15:19:16] 06SRE, 10SRE-Access-Requests: Requesting access to ml-lab-users for ttaylor - https://phabricator.wikimedia.org/T410752#11396328 (10calbon) Approved [15:19:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [15:24:34] 06SRE, 10SRE-Access-Requests: Requesting access to ml-lab-users for ttaylor - https://phabricator.wikimedia.org/T410752#11396332 (10DPogorzelski-WMF) [15:27:33] (03PS1) 10Dpogorzelski: ml-lab-users: add ttaylor [puppet] - 10https://gerrit.wikimedia.org/r/1208377 (https://phabricator.wikimedia.org/T410752) [15:28:38] (03PS1) 10Bking: relforge: switch to UEFI partition scheme [puppet] - 10https://gerrit.wikimedia.org/r/1208378 (https://phabricator.wikimedia.org/T410681) [15:30:08] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1208377 (https://phabricator.wikimedia.org/T410752) (owner: 10Dpogorzelski) [15:30:25] (03PS11) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [15:30:31] (03CR) 10Dpogorzelski: [C:03+2] ml-lab-users: add ttaylor [puppet] - 10https://gerrit.wikimedia.org/r/1208377 (https://phabricator.wikimedia.org/T410752) (owner: 10Dpogorzelski) [15:30:38] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [15:32:11] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host relforge1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:33:22] (03CR) 10Bking: [C:03+2] relforge: switch to UEFI partition scheme [puppet] - 10https://gerrit.wikimedia.org/r/1208378 (https://phabricator.wikimedia.org/T410681) (owner: 10Bking) [15:33:32] (03CR) 10Bking: [C:03+2] "self-merging in the interest of time" [puppet] - 10https://gerrit.wikimedia.org/r/1208378 (https://phabricator.wikimedia.org/T410681) (owner: 10Bking) [15:34:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:17] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host relforge1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:40:33] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host relforge1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:40:36] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host relforge1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:42:14] (03CR) 10Dillon: [C:03+1] Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle) [15:44:41] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11396438 (10Jclark-ctr) a:03Jclark-ctr supermicro case 00064512 [15:46:57] (03PS1) 10LorenMora: [Legal Footer] Create config for adding legal footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) [15:47:39] (03PS1) 10Ssingh: plugins/wmf-netbox: set ipv4_only for hcaptcha-proxy [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1208381 (https://phabricator.wikimedia.org/T409780) [15:48:04] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice-archive: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11396452 (10CDanis) >>! In T400119#11395040, @Guycn2 wrote: > This seems to be blocking legitimate web captures by the Internet Archive (see, f... [15:50:18] (03PS1) 10Btullis: Record the fact that ankita97531 has a kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1208383 (https://phabricator.wikimedia.org/T410389) [15:50:26] (03CR) 10Ssingh: "All internal services -- recdns, ntp-[abc], syslog, do IPv4 so I think it's safe for right now to do this as v4 as well. We can revisit it" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1208381 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [15:58:22] (03PS1) 10Ssingh: P:cumin: add aliases for hcaptcha-proxy VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208385 (https://phabricator.wikimedia.org/T409780) [15:59:07] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7679/co" [puppet] - 10https://gerrit.wikimedia.org/r/1208385 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [15:59:58] (03CR) 10Ssingh: P:cumin: add aliases for hcaptcha-proxy VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208385 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [16:01:38] (03CR) 10Ssingh: [C:03+2] P:bird::anycast_monitoring: add hcaptcha-proxy.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1204074 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [16:02:54] (03PS2) 10Cyndywikime: [Growth]:Remove GELevelingUpNewNotificationsEnabled config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207133 (https://phabricator.wikimedia.org/T407431) [16:04:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [16:05:14] (03CR) 10Cyndywikime: "This patch is ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207133 (https://phabricator.wikimedia.org/T407431) (owner: 10Cyndywikime) [16:08:34] (03CR) 10Ssingh: "Deferring to Brett for the review." [dns] - 10https://gerrit.wikimedia.org/r/1207288 (owner: 10Dzahn) [16:08:37] (03CR) 10Btullis: [C:03+2] Record the fact that ankita97531 has a kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/1208383 (https://phabricator.wikimedia.org/T410389) (owner: 10Btullis) [16:10:07] (03PS3) 10Tiziano Fogli: metamonitoring/icinga: trigger pages only for the active instance [puppet] - 10https://gerrit.wikimedia.org/r/1207113 (https://phabricator.wikimedia.org/T393625) [16:10:28] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [16:11:29] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host relforge1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:14:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [16:14:41] 06SRE, 10SRE-Access-Requests: Requesting access to ml-lab-users for ttaylor - https://phabricator.wikimedia.org/T410752#11396544 (10BTullis) 05Open→03Resolved a:03BTullis [16:16:45] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host relforge1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:19:14] (03PS1) 10Dzahn: httpbb: move os-reports test file for services on miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1208398 [16:21:37] (03PS1) 10Dzahn: httpbb: delete tests on legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208399 [16:22:20] (03CR) 10Bking: [C:03+2] opensearch-cluster: Add cluster ro perms to 'opensearch' user, increase default num of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207194 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking) [16:24:18] (03PS1) 10Dzahn: installserver: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208400 [16:27:16] (03PS1) 10Dzahn: prometheus: drop class config for role::miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1208401 [16:28:09] (03PS1) 10Dzahn: site: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208402 [16:29:56] (03PS2) 10Dzahn: prometheus: drop class config for role::miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1208401 (https://phabricator.wikimedia.org/T397080) [16:31:10] (03PS2) 10Dzahn: site: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208402 (https://phabricator.wikimedia.org/T397080) [16:31:40] (03PS2) 10Dzahn: installserver: remove legacy miscweb VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208400 (https://phabricator.wikimedia.org/T397080) [16:31:55] (03PS2) 10Dzahn: httpbb: move os-reports test file for services on miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1208398 (https://phabricator.wikimedia.org/T397080) [16:33:49] (03CR) 10Dzahn: [C:03+2] admin: transfer group approver for releasers-mediawiki to Mateus Santos [puppet] - 10https://gerrit.wikimedia.org/r/1207304 (owner: 10Dzahn) [16:34:17] FIRING: [4x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [16:34:30] (03CR) 10Dzahn: [C:03+2] "thanks. @tcipriani@wikimedia.org @msantos@wikimedia.org done!" [puppet] - 10https://gerrit.wikimedia.org/r/1207304 (owner: 10Dzahn) [16:35:19] (03CR) 10Dzahn: [C:03+2] "thanks, WMDE-Fisch. done" [puppet] - 10https://gerrit.wikimedia.org/r/1207967 (owner: 10Dzahn) [16:35:32] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1208385 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [16:36:42] (03CR) 10Dzahn: "relengers: wanna confirm this?" [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn) [16:36:53] (03CR) 10Ssingh: [C:03+2] P:cumin: add aliases for hcaptcha-proxy VMs [puppet] - 10https://gerrit.wikimedia.org/r/1208385 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [16:39:17] FIRING: [4x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [16:39:27] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:43:25] (03PS3) 10Andrew Bogott: profile::idp: require many args to be non-empty [puppet] - 10https://gerrit.wikimedia.org/r/1208370 [16:43:25] (03PS12) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [16:44:21] (03CR) 10BPirkle: [C:03+1] "Deprecations confirmed locally" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206466 (https://phabricator.wikimedia.org/T409773) (owner: 10Aaron Schulz) [16:44:50] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [16:47:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDo [16:48:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqord and cr1-eqiad (208.80.154.196) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqord:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr1-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:48:54] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host relforge1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:52:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:53:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11396652 (10RobH) Sent an email to @btullis to ensure he is aware of these 12 hosts pending his feedback, subject line: Ne... [16:53:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:54:09] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host relforge1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:54:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [16:56:22] (03PS1) 10Dzahn: releases::mediawiki: change the time when jenkins is restarted [puppet] - 10https://gerrit.wikimedia.org/r/1208406 (https://phabricator.wikimedia.org/T410729) [16:57:19] (03PS2) 10Dzahn: releases::mediawiki: change the time when jenkins is restarted [puppet] - 10https://gerrit.wikimedia.org/r/1208406 (https://phabricator.wikimedia.org/T410729) [16:57:35] (03PS3) 10Dzahn: releases::mediawiki: change the time when jenkins is restarted [puppet] - 10https://gerrit.wikimedia.org/r/1208406 (https://phabricator.wikimedia.org/T410729) [16:59:34] (03CR) 10Cathal Mooney: [C:03+1] plugins/wmf-netbox: set ipv4_only for hcaptcha-proxy [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1208381 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [16:59:37] (03CR) 10Cathal Mooney: [C:03+2] plugins/wmf-netbox: set ipv4_only for hcaptcha-proxy [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1208381 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [17:01:30] (03PS1) 10DDesouza: Pre-deploy 2025 Global Readers Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208408 (https://phabricator.wikimedia.org/T410696) [17:01:47] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Homer release v0.11.0 minor update - cmooney@cumin1003 [17:02:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11396665 (10RobH) Day 8 Update: * 3 hosts moved, 19 remain * John worked with Amir directly today to depool and migrate pc101[678] since the depool and repoo... [17:03:24] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Homer release v0.11.0 minor update - cmooney@cumin1003 [17:09:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:13:33] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11396695 (10Volans) @Milimetric / @Ahoelzl by any chance one of you could review this task for approval? [17:15:45] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host relforge1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:19:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-6gdw7:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:20:51] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host relforge1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:24:05] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host relforge1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:24:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-6gdw7:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:24:19] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11396737 (10Andrew) (I'm moving this to a private address; lots of cookbook things to come) [17:25:06] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host relforge1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:27:23] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host relforge1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:27:40] !log homer "cr*" commit "remove IPv6 for hcaptcha-proxy group": T409780 [17:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:51] (03PS4) 10Andrew Bogott: profile::idp: require many args to be non-empty [puppet] - 10https://gerrit.wikimedia.org/r/1208370 [17:27:51] (03PS13) 10Andrew Bogott: cloudidp-dev: Hiera changes to make more like normal idp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1208350 (https://phabricator.wikimedia.org/T410294) [17:27:51] (03PS1) 10Andrew Bogott: cloudsetpidp2001-dev: prepare to move to a private IP [puppet] - 10https://gerrit.wikimedia.org/r/1208411 (https://phabricator.wikimedia.org/T410294) [17:29:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11396746 (10RobH) IRC Update: @btullis and I have chatted via IRC and worked out scheduling for the 12 remaining hosts:... [17:30:03] (03CR) 10Andrew Bogott: [C:03+2] cloudsetpidp2001-dev: prepare to move to a private IP [puppet] - 10https://gerrit.wikimedia.org/r/1208411 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [17:30:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11396750 (10BTullis) Apologies for the delays in responding. After discussion with @RobH we have decided on the following:... [17:32:28] !log andrew@cumin2002 START - Cookbook sre.ganeti.makevm for new host cloudidp2001-dev.codfw.wmnet [17:32:30] !log andrew@cumin2002 START - Cookbook sre.dns.netbox [17:32:38] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host relforge1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:33:12] !log homer "as*" commit "remove IPv6 for hcaptcha-proxy group": T409780 [17:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:36:31] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudidp2001-dev.codfw.wmnet - andrew@cumin2002" [17:36:36] !log andrew@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudidp2001-dev.codfw.wmnet - andrew@cumin2002" [17:36:37] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:36:37] !log andrew@cumin2002 START - Cookbook sre.dns.wipe-cache cloudidp2001-dev.codfw.wmnet on all recursors [17:36:40] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudidp2001-dev.codfw.wmnet on all recursors [17:37:12] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cloudidp2001-dev.codfw.wmnet - andrew@cumin2002" [17:37:17] !log andrew@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cloudidp2001-dev.codfw.wmnet - andrew@cumin2002" [17:37:44] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudidp2001-dev.codfw.wmnet with OS trixie [17:38:08] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11396759 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cum... [17:42:17] !log bking@cumin1003 START - Cookbook sre.hosts.reimage for host relforge1008.eqiad.wmnet with OS bookworm [17:44:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:44:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:48:47] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for dsmit - https://phabricator.wikimedia.org/T410426#11396783 (10Volans) @DSmit-WMF did you had a chance to go through the documentation to clarify which access do you need to the `analytics-privatedata-users` group? Perhaps just l... [17:51:37] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1009.eqiad.wmnet with OS bookworm [17:51:54] (03PS7) 10Btullis: Add a new deploy-spark-support clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) [17:51:54] (03PS7) 10Btullis: Enable the deploy-spark-support deploy clusterrole for two test namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208317 (https://phabricator.wikimedia.org/T410017) [17:51:54] (03PS7) 10Btullis: Add an analytics namespace to both dse-k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208318 (https://phabricator.wikimedia.org/T410017) [17:52:21] !log bking@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1008.eqiad.wmnet with reason: host reimage [17:52:45] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1010.eqiad.wmnet with OS bookworm [17:52:59] (03PS1) 10CDobbins: sre.loadbalancer: WIP patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) [17:53:03] (03PS1) 10Scott French: admin: Add FIDO-backed ssh key for swfrench [puppet] - 10https://gerrit.wikimedia.org/r/1208416 [17:53:33] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudidp2001-dev.codfw.wmnet with OS trixie [17:53:33] !log andrew@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host cloudidp2001-dev.codfw.wmnet [17:53:51] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11396797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin20... [17:54:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11396798 (10RobH) Can we also set a date/time for moving the other three hosts remaining? wikikube-ctrl1003 kafka-main1008 kafka-main1009 [17:54:16] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudidp2001-dev.codfw.wmnet with OS trixie [17:57:26] (03CR) 10Btullis: Add a new deploy-spark-support clusterrole (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208316 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [17:57:46] !log bking@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1008.eqiad.wmnet with reason: host reimage [17:59:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:59:43] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: WIP patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:01:52] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1009.eqiad.wmnet with reason: host reimage [18:02:46] (03PS2) 10CDobbins: sre.loadbalancer: WIP patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) [18:04:10] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1010.eqiad.wmnet with reason: host reimage [18:04:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:08:23] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudidp2001-dev.codfw.wmnet with OS trixie [18:08:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1009.eqiad.wmnet with reason: host reimage [18:12:33] !log bking@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1008.eqiad.wmnet with OS bookworm [18:12:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1010.eqiad.wmnet with reason: host reimage [18:12:54] (03PS3) 10CDobbins: sre.loadbalancer: WIP patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) [18:14:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:15:27] (03CR) 10Scott French: "I'll follow up on this early next week." [puppet] - 10https://gerrit.wikimedia.org/r/1208416 (owner: 10Scott French) [18:23:32] (03CR) 10Ssingh: "Yeah this can work, let's confirm Monday!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1208415 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:23:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1009.eqiad.wmnet with OS bookworm [18:24:23] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudidp2001-dev.codfw.wmnet with OS trixie [18:27:47] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1010.eqiad.wmnet with OS bookworm [18:29:02] (03CR) 10AOkoth: [C:03+1] gerrit: add a local backup cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [18:29:54] (03CR) 10AOkoth: vrts: alert on vrts junk queue size (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1201087 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [18:30:03] (03PS3) 10AOkoth: vrts: alert on vrts junk queue size [alerts] - 10https://gerrit.wikimedia.org/r/1201087 (https://phabricator.wikimedia.org/T408632) [18:34:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:34:59] (03CR) 10AOkoth: [C:03+2] vrts: alert on vrts junk queue size [alerts] - 10https://gerrit.wikimedia.org/r/1201087 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [18:36:10] (03Merged) 10jenkins-bot: vrts: alert on vrts junk queue size [alerts] - 10https://gerrit.wikimedia.org/r/1201087 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [18:38:37] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudidp2001-dev.codfw.wmnet with OS trixie [18:39:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:54:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [18:54:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:03:00] (03PS1) 10Ayounsi: ayounsi: Add new yubikey key [puppet] - 10https://gerrit.wikimedia.org/r/1208426 [19:09:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [19:12:41] (03CR) 10C. Scott Ananian: [C:03+1] Deploy Parsoid Read Views to 18 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207276 (https://phabricator.wikimedia.org/T410564) (owner: 10Arlolra) [19:32:49] (03PS1) 10Ebernhardson: cirrus relforge: Setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1208432 (https://phabricator.wikimedia.org/T410681) [19:33:30] (03CR) 10CI reject: [V:04-1] cirrus relforge: Setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1208432 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [19:39:43] (03CR) 10Muehlenhoff: cirrus relforge: Setup firewall rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1208432 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [19:44:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [19:49:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [20:04:19] (03PS2) 10Ebernhardson: cirrus relforge: Setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1208432 (https://phabricator.wikimedia.org/T410681) [20:04:19] (03CR) 10Ebernhardson: cirrus relforge: Setup firewall rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1208432 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [20:05:07] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7680/co" [puppet] - 10https://gerrit.wikimedia.org/r/1208432 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [20:12:20] (03CR) 10MusikAnimal: [metawiki] enable voting on entities with the 'Under review' status (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) (owner: 10MusikAnimal) [20:24:06] (03PS1) 10RLazarus: alertmanager: Add a receiver for experiment-platform-ircmailtask [puppet] - 10https://gerrit.wikimedia.org/r/1208437 (https://phabricator.wikimedia.org/T398869) [20:24:08] (03PS1) 10RLazarus: pyrra: Enable SLO alerts for experimentation-platform [puppet] - 10https://gerrit.wikimedia.org/r/1208438 (https://phabricator.wikimedia.org/T398869) [20:26:37] (03PS1) 10Ladsgroup: Fix db config for offline maint scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208439 (https://phabricator.wikimedia.org/T410738) [20:30:45] (03CR) 10Dr0ptp4kt: [C:03+1] pyrra: Enable SLO alerts for experimentation-platform [puppet] - 10https://gerrit.wikimedia.org/r/1208438 (https://phabricator.wikimedia.org/T398869) (owner: 10RLazarus) [20:30:49] (03CR) 10Bking: [C:03+2] cirrus relforge: Setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1208432 (https://phabricator.wikimedia.org/T410681) (owner: 10Ebernhardson) [20:34:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [20:35:48] !log andrew@cumin2002 START - Cookbook sre.ganeti.makevm for new host cloudidp2001-dev.codfw.wmnet [20:35:50] !log andrew@cumin2002 START - Cookbook sre.dns.netbox [20:36:48] PROBLEM - Host cloudidp2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [20:37:06] (03PS1) 10Dzahn: zuul: add $service_ensure parameter for zuul services (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1208441 (https://phabricator.wikimedia.org/T410756) [20:37:08] (03CR) 10Dr0ptp4kt: [C:03+1] alertmanager: Add a receiver for experiment-platform-ircmailtask [puppet] - 10https://gerrit.wikimedia.org/r/1208437 (https://phabricator.wikimedia.org/T398869) (owner: 10RLazarus) [20:39:17] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [20:39:27] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:39:50] (03PS2) 10RLazarus: pyrra: Enable SLO alerts for experiment-platform [puppet] - 10https://gerrit.wikimedia.org/r/1208438 (https://phabricator.wikimedia.org/T398869) [20:40:35] !log zuul2002 - rm /lib/systemd/system/zuul* ; systemctl daemon-reload ; systemctl reset-failed - fixes T410756 [20:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:39] T410756: SystemdUnitFailed - zuul-executor.service on zuul2002 - https://phabricator.wikimedia.org/T410756 [20:41:31] andrew@cumin2002 makevm (PID 2033764) is awaiting input [20:42:36] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudidp2001-dev.codfw.wmnet - andrew@cumin2002" [20:42:50] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1208437 (https://phabricator.wikimedia.org/T398869) (owner: 10RLazarus) [20:43:27] (03PS1) 10Dzahn: Revert "site: move zuul2002 to insetup role temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/1208443 [20:44:19] !log andrew@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudidp2001-dev.codfw.wmnet - andrew@cumin2002" [20:44:19] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:44:20] !log andrew@cumin2002 START - Cookbook sre.dns.wipe-cache cloudidp2001-dev.codfw.wmnet on all recursors [20:44:23] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudidp2001-dev.codfw.wmnet on all recursors [20:44:55] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cloudidp2001-dev.codfw.wmnet - andrew@cumin2002" [20:45:00] !log andrew@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cloudidp2001-dev.codfw.wmnet - andrew@cumin2002" [20:45:07] (03PS2) 10RLazarus: alertmanager: Add a receiver for experiment-platform-ircmailtask [puppet] - 10https://gerrit.wikimedia.org/r/1208437 (https://phabricator.wikimedia.org/T398869) [20:45:07] (03PS3) 10RLazarus: pyrra: Enable SLO alerts for experiment-platform [puppet] - 10https://gerrit.wikimedia.org/r/1208438 (https://phabricator.wikimedia.org/T398869) [20:45:20] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudidp2001-dev.codfw.wmnet with OS trixie [20:45:24] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [20:45:27] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts sretest1002.eqiad.wmnet [20:45:44] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11397499 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cum... [20:46:07] (03CR) 10RLazarus: [C:03+2] "Thanks both!" [puppet] - 10https://gerrit.wikimedia.org/r/1208437 (https://phabricator.wikimedia.org/T398869) (owner: 10RLazarus) [20:46:40] (03CR) 10RLazarus: [C:03+2] pyrra: Enable SLO alerts for experiment-platform [puppet] - 10https://gerrit.wikimedia.org/r/1208438 (https://phabricator.wikimedia.org/T398869) (owner: 10RLazarus) [20:49:17] FIRING: [4x] CalicoHighMemoryUsage: Calico container calico-node-bfhcs:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [20:49:30] (03PS5) 10Pppery: Remove bad Norwegian funnels [puppet] - 10https://gerrit.wikimedia.org/r/1208442 (https://phabricator.wikimedia.org/T407553) [20:49:48] (03CR) 10Zabe: [C:03+1] Fix db config for offline maint scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208439 (https://phabricator.wikimedia.org/T410738) (owner: 10Ladsgroup) [20:54:35] 10SRE-SLO, 07OKR-Work, 13Patch-For-Review: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11397523 (10RLazarus) Alerts are enabled! Let's continue to monitor here a tiny bit longer, just in case they behave unexpectedly and the initial config needs tweaking -- but after a few days of... [20:57:27] (03PS1) 10Dzahn: releases: stop automatic jenkins restarts, for now [puppet] - 10https://gerrit.wikimedia.org/r/1208447 (https://phabricator.wikimedia.org/T410729) [20:58:15] (03PS2) 10Dzahn: releases: stop automatic jenkins restarts, for now [puppet] - 10https://gerrit.wikimedia.org/r/1208447 (https://phabricator.wikimedia.org/T410729) [20:58:16] PROBLEM - snapshot of s6 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s6 at eqiad (db1225) taken on 2025-11-21 20:36:28 is 367 GiB, but the previous one was 463 GiB, a change of -20.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [20:58:55] (03CR) 10Dzahn: "for now: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1208447" [puppet] - 10https://gerrit.wikimedia.org/r/1208406 (https://phabricator.wikimedia.org/T410729) (owner: 10Dzahn) [20:58:57] FIRING: [4x] CalicoHighMemoryUsage: Calico container calico-node-bfhcs:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [20:59:36] FIRING: [5x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [21:01:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:03:32] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudidp2001-dev.codfw.wmnet with OS trixie [21:03:32] !log andrew@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host cloudidp2001-dev.codfw.wmnet [21:03:47] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11397534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin20... [21:07:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:09:17] FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:09:36] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:11:49] PROBLEM - Host cloudidp2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [21:12:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:13:58] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [21:17:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:17:56] (03PS1) 10Dzahn: releases: rename releases::mediawiki to jenkins, remove unused Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1208450 [21:18:57] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [21:19:17] RESOLVED: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:21:47] (03PS1) 10RLazarus: admin: Move rzl pre-FIDO ssh key to buster only [puppet] - 10https://gerrit.wikimedia.org/r/1208451 [21:22:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:22:17] FIRING: [3x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:23:04] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1208447/7685/" [puppet] - 10https://gerrit.wikimedia.org/r/1208447 (https://phabricator.wikimedia.org/T410729) (owner: 10Dzahn) [21:24:28] (03CR) 10RLazarus: [C:03+1] deployment_server: drop PHP 8.1 fallback in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207979 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [21:25:17] (03CR) 10RLazarus: [C:03+1] deployment_server: switch mw-script/main to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1207980 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [21:25:34] (03CR) 10RLazarus: [C:03+1] deployment_server: follow main release in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207981 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [21:27:17] RESOLVED: [5x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:30:25] (03CR) 10Scott French: [C:03+1] admin: Move rzl pre-FIDO ssh key to buster only [puppet] - 10https://gerrit.wikimedia.org/r/1208451 (owner: 10RLazarus) [21:31:17] FIRING: [4x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:13] (03CR) 10RLazarus: [C:03+1] admin: Add FIDO-backed ssh key for swfrench [puppet] - 10https://gerrit.wikimedia.org/r/1208416 (owner: 10Scott French) [21:35:49] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1208416 (owner: 10Scott French) [21:36:17] FIRING: [11x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:36:18] (03CR) 10Scott French: [C:03+2] admin: Add FIDO-backed ssh key for swfrench [puppet] - 10https://gerrit.wikimedia.org/r/1208416 (owner: 10Scott French) [21:37:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:41:17] FIRING: [9x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:42:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:48:58] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:42] FIRING: AlertLintProblem: Linting problems found for DiskSpace - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [21:56:17] FIRING: [8x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:03:47] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [22:03:54] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.restart (exit_code=97) [22:04:15] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [22:16:49] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [22:17:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:21:17] FIRING: [4x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:23:57] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [22:24:48] !log bking@wdqs1011 `systemctl restart wdqs-blazegraph.service` (responding to ProbeDown) [22:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.894s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:28:57] (03PS10) 10Scott French: P:cache::varnish::frontend: render known-client rate limit VCL [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) [22:36:26] (03CR) 10Scott French: P:cache::varnish::frontend: render known-client rate limit VCL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [22:38:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:38:57] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [22:39:02] RESOLVED: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:40:34] (03PS1) 10BryanDavis: labswiki: Enable sitenotice on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208478 (https://phabricator.wikimedia.org/T410702) [22:49:55] !log bking@wdqs2007 roll-restart wdqs CODFW for high lag https://w.wiki/GDad [22:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:52:48] updating envoy in wikikube staging only, no production impact [22:53:32] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply [22:53:55] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply [22:54:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:57:49] just kidding! doing that for real in a little while [22:58:43] <3 [23:02:55] (03PS1) 10RLazarus: api-gateway: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208483 [23:03:58] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-gq2sk:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:14:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T410589)', diff saved to https://phabricator.wikimedia.org/P85449 and previous config saved to /var/cache/conftool/dbconfig/20251121-231439-ladsgroup.json [23:14:45] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [23:16:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [23:17:52] (03PS1) 10RLazarus: all charts: Update mesh.configuration 1.14.1 to 1.15.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208484 (https://phabricator.wikimedia.org/T409510) [23:18:57] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:21:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [23:23:58] FIRING: [4x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:27:32] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:28:58] FIRING: [4x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:29:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P85450 and previous config saved to /var/cache/conftool/dbconfig/20251121-232946-ladsgroup.json [23:32:32] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:40:00] (03CR) 10Thcipriani: [C:03+1] admin: deprecate the releasers-blubber group [puppet] - 10https://gerrit.wikimedia.org/r/1207313 (owner: 10Dzahn) [23:43:57] FIRING: [4x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:44:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P85451 and previous config saved to /var/cache/conftool/dbconfig/20251121-234454-ladsgroup.json [23:45:13] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [23:48:57] FIRING: [4x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:50:04] (03CR) 10Scott French: [C:03+1] api-gateway: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208483 (owner: 10RLazarus) [23:50:34] (03CR) 10RLazarus: "(As discussed offline, lining this up for Monday but won't merge it until then, to avoid leaving every chart with undeployed diffs over th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208484 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [23:51:01] (03CR) 10RLazarus: [C:03+2] api-gateway: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208483 (owner: 10RLazarus) [23:52:56] (03Merged) 10jenkins-bot: api-gateway: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208483 (owner: 10RLazarus) [23:53:57] FIRING: [3x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage