[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T0000) [00:02:22] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1216680 [00:02:26] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1216681 [00:02:31] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1216682 [00:04:26] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1216681 (owner: 10Ncmonitor) [00:14:57] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:31:11] RECOVERY - dump of s5 in codfw on backupmon1001 is OK: Last dump for s5 at codfw (db2201) taken on 2025-12-09 00:00:02 (51 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:32:46] FIRING: Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:37:46] FIRING: [2x] Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:39:15] RECOVERY - dump of s6 in codfw on backupmon1001 is OK: Last dump for s6 at codfw (db2197) taken on 2025-12-09 00:00:05 (61 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:39:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1216686 [00:39:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1216686 (owner: 10TrainBranchBot) [00:44:11] RECOVERY - dump of s6 in eqiad on backupmon1001 is OK: Last dump for s6 at eqiad (db1225) taken on 2025-12-09 00:00:05 (61 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:51:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1216686 (owner: 10TrainBranchBot) [00:52:46] FIRING: [2x] Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:57:46] RESOLVED: Traffic bill over quota: Alert for device cr2-esams.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:00:40] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:04:13] RECOVERY - dump of s5 in eqiad on backupmon1001 is OK: Last dump for s5 at eqiad (db1216) taken on 2025-12-09 00:00:05 (51 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:04:27] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11442705 (10Jclark-ctr) If I can do tomorrow between 3pm -6pm? [01:04:45] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11442708 (10Jclark-ctr) @rkemper^ [01:10:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1216692 [01:10:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1216692 (owner: 10TrainBranchBot) [01:18:20] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 17m 40s) [01:34:08] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1216692 (owner: 10TrainBranchBot) [01:36:46] FIRING: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:56:46] RESOLVED: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [02:16:40] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T0300) [03:15:22] (03PS1) 10RLazarus: mathoid: Upgrade to envoy-future:1.35.7 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216701 (https://phabricator.wikimedia.org/T410975) [03:15:24] (03PS1) 10RLazarus: {api,rest}-gateway: Update staging to Envoy 1.35.7 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216702 (https://phabricator.wikimedia.org/T410975) [03:30:13] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:47:06] FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T0400) [04:14:57] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:18:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [04:18:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [04:21:45] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [04:22:46] FIRING: Primary inbound port utilisation over 90% #page: Alert for device cr3-eqsin.wikimedia.org - Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [04:23:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [04:23:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [04:26:45] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [04:27:45] RESOLVED: Primary inbound port utilisation over 90% #page: Device cr3-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [04:36:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [04:37:46] FIRING: Primary inbound port utilisation over 90% #page: Alert for device cr3-eqsin.wikimedia.org - Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [04:41:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [04:42:46] RESOLVED: Primary inbound port utilisation over 90% #page: Device cr3-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [04:51:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [04:52:45] FIRING: Primary inbound port utilisation over 90% #page: Alert for device cr3-eqsin.wikimedia.org - Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [04:56:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [04:57:45] RESOLVED: Primary inbound port utilisation over 90% #page: Device cr3-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T0500) [05:02:46] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.3 (duration: 02m 44s) [05:02:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T410589)', diff saved to https://phabricator.wikimedia.org/P86462 and previous config saved to /var/cache/conftool/dbconfig/20251209-050258-ladsgroup.json [05:03:03] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [05:10:02] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:46] FIRING: Primary outbound port utilisation over 90% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [05:12:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [05:12:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [05:13:46] FIRING: Primary inbound port utilisation over 90% #page: Alert for device cr3-eqsin.wikimedia.org - Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [05:17:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [05:17:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [05:18:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P86463 and previous config saved to /var/cache/conftool/dbconfig/20251209-051806-ladsgroup.json [05:19:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [05:19:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [05:33:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P86464 and previous config saved to /var/cache/conftool/dbconfig/20251209-053314-ladsgroup.json [05:34:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [05:34:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [05:35:02] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [05:38:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [05:48:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T410589)', diff saved to https://phabricator.wikimedia.org/P86465 and previous config saved to /var/cache/conftool/dbconfig/20251209-054822-ladsgroup.json [05:48:26] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [05:48:27] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2201.codfw.wmnet with reason: Maintenance [05:53:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [05:53:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [06:00:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [06:00:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [06:10:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [06:10:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [06:12:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [06:12:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [06:16:40] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:22:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [06:22:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [06:26:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [06:26:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [06:39:50] (03PS1) 10Nvdtn19: Configuration for viwikivoyage per T405724 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 [06:45:38] (03Abandoned) 10Nvdtn19: Configuration for viwikivoyage per T405724 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [06:48:07] (03Restored) 10Nvdtn19: Configuration for viwikivoyage per T405724 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [06:52:00] (03PS1) 10Nvdtn19: Configuration for viwikivoyage per T405724 (fix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216722 [06:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T0700) [07:00:05] marostegui, Amir1, and federico3: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T0700). [07:01:46] RESOLVED: Primary outbound port utilisation over 90% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+90%25++%23page [07:01:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [07:01:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [07:02:59] (03PS2) 10Nvdtn19: Configuration for viwikivoyage per T405724 (fix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216722 [07:03:41] (03Abandoned) 10Nvdtn19: Configuration for viwikivoyage per T405724 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [07:03:46] RESOLVED: Primary inbound port utilisation over 90% #page: Device cr3-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 90% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+90%25++%23page [07:12:33] (03Restored) 10Nvdtn19: Configuration for viwikivoyage per T405724 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [07:14:09] (03PS2) 10Nvdtn19: Configuration for viwikivoyage per T405724 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 [07:14:40] (03Abandoned) 10Nvdtn19: Configuration for viwikivoyage per T405724 (fix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216722 (owner: 10Nvdtn19) [07:30:13] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:40:15] 07sre-alert-triage, 06DC-Ops, 06Infrastructure-Foundations: Alert in need of triage: SmartNotHealthy (instance sretest2006:9100) - https://phabricator.wikimedia.org/T412078 (10LSobanski) 03NEW [07:41:26] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Wikimedia-Mailing-lists: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11442945 (10LSobanski) [07:47:06] FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:52:43] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1216573 (owner: 10L10n-bot) [07:56:20] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11442948 (10Leif_WMDE) Hello @Dzahn thanks for the update and connecting to Katie. The NDA is signed. Cheers, Leif [08:00:05] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T0800). Please do the needful. [08:00:05] sfaci and WMDE-Fisch: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:30] \o [08:00:38] \o [08:01:43] I could deploy I guess [08:03:12] ....if I could log in ... [08:04:19] Isn't th Spider Pig OTP the "old" Wikitech OTP? 🤔 [08:06:01] old? isn't it the new one? [08:07:44] ah lol wait a sec I should have read the instructions again [08:09:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215214 (https://phabricator.wikimedia.org/T407570) (owner: 10Santiago Faci) [08:09:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy2002 using scap backport" [extensions/Cite] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216553 (https://phabricator.wikimedia.org/T411245) (owner: 10WMDE-Fisch) [08:14:57] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:16:08] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216749 [08:17:44] (03Merged) 10jenkins-bot: ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215214 (https://phabricator.wikimedia.org/T407570) (owner: 10Santiago Faci) [08:19:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216749 (owner: 10Matthias Mullie) [08:19:40] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1006 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:20:32] (03Merged) 10jenkins-bot: VE: Don't create a synth ref when there's a LDR main ref [extensions/Cite] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216553 (https://phabricator.wikimedia.org/T411245) (owner: 10WMDE-Fisch) [08:21:22] !log wmde-fisch@deploy2002 Started scap sync-world: Backport for [[gerrit:1215214|ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument (T407570)]], [[gerrit:1216553|VE: Don't create a synth ref when there's a LDR main ref (T411245)]] [08:21:28] T407570: Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570 [08:21:28] T411245: Dirty diff when adding details to references that have list defined content - https://phabricator.wikimedia.org/T411245 [08:23:23] !log wmde-fisch@deploy2002 wmde-fisch, sfaci: Backport for [[gerrit:1215214|ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument (T407570)]], [[gerrit:1216553|VE: Don't create a synth ref when there's a LDR main ref (T411245)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:23:37] sfaci: Anything you want to test? [08:24:01] No, it's just the implementation of an experiment that needs to be activated first [08:24:08] nothing to test for now [08:24:09] thanks! [08:24:40] RESOLVED: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1006 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:25:03] !log wmde-fisch@deploy2002 wmde-fisch, sfaci: Continuing with sync [08:25:14] sfaci: cool syncing now [08:25:23] thanks! [08:26:30] WMDE-Fisch: I just added a late addition to the backport window, but can handle deployment myself - can you ping when once you're done? Thanks! [08:26:35] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [08:26:38] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [08:27:10] matthiasmullie: Yes I'll do. Just wanted to ask you that :-) [08:29:40] (03PS1) 10Alexandros Kosiaris: Update fc-list to point to fc-list Tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216750 (https://phabricator.wikimedia.org/T280718) [08:30:19] !log wmde-fisch@deploy2002 Finished scap sync-world: Backport for [[gerrit:1215214|ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument (T407570)]], [[gerrit:1216553|VE: Don't create a synth ref when there's a LDR main ref (T411245)]] (duration: 08m 56s) [08:30:23] T407570: Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570 [08:30:24] T411245: Dirty diff when adding details to references that have list defined content - https://phabricator.wikimedia.org/T411245 [08:30:52] sfaci: matthiasmullie all done! [08:31:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216749 (owner: 10Matthias Mullie) [08:31:28] Cool! Thank you very much! [08:31:44] WMDE-Fisch: Thanks. Starting [08:31:45] You're welcome :-) [08:31:54] * WMDE-Fisch out [08:32:20] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11442983 (10jcrespo) [08:34:17] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216749 (owner: 10Matthias Mullie) [08:34:35] !log mlitn@deploy2002 Started scap sync-world: Backport for [[gerrit:1216749|Squashed diff to master]] [08:36:50] !log mlitn@deploy2002 mlitn: Backport for [[gerrit:1216749|Squashed diff to master]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:38:06] !log mlitn@deploy2002 mlitn: Continuing with sync [08:38:12] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11442988 (10jcrespo) [08:39:40] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1006 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:41:19] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11442990 (10jcrespo) While we process your request, to speed up the current and future requests, could I ask you, @Solenne_... [08:42:09] !log mlitn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1216749|Squashed diff to master]] (duration: 07m 34s) [08:44:40] RESOLVED: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1006 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:44:59] Nothing left to deploy this window? [08:46:35] !log UTC morning backports done [08:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:55:04] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11442995 (10jcrespo) Hi, @Lena_WMDE we don't have you on the list of approval managers for WMDE: https://wikitech.wikimedia... [08:55:34] (03PS1) 10Ayounsi: Turnilo: annotate well known JA3N [puppet] - 10https://gerrit.wikimedia.org/r/1216753 [08:59:17] (03CR) 10Scott French: [C:03+1] mathoid: Upgrade to envoy-future:1.35.7 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216701 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [08:59:19] (03CR) 10Scott French: [C:03+1] {api,rest}-gateway: Update staging to Envoy 1.35.7 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216702 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [09:00:40] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1006 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:04:12] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11443001 (10jcrespo) Hi, @KFrancis requesting an NDA filing for the email show on the header above for the given WMDE emplo... [09:05:40] RESOLVED: KubernetesRsyslogDown: rsyslog on dse-k8s-worker1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker1006 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:05:48] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11443002 (10jcrespo) [09:12:15] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11443011 (10jcrespo) [09:12:52] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11443012 (10jcrespo) [09:14:41] (03CR) 10Elukey: [C:03+1] Nokia: ensure disabled ports speed is set correctly [homer/public] - 10https://gerrit.wikimedia.org/r/1216595 (https://phabricator.wikimedia.org/T409178) (owner: 10Ayounsi) [09:16:20] (03PS1) 10Slyngshede: P:idp allow id-token claims to be enabled [puppet] - 10https://gerrit.wikimedia.org/r/1216755 (https://phabricator.wikimedia.org/T411752) [09:17:10] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7799/console" [puppet] - 10https://gerrit.wikimedia.org/r/1216755 (https://phabricator.wikimedia.org/T411752) (owner: 10Slyngshede) [09:18:08] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7800/console" [puppet] - 10https://gerrit.wikimedia.org/r/1216755 (https://phabricator.wikimedia.org/T411752) (owner: 10Slyngshede) [09:19:47] (03CR) 10Brouberol: [C:03+2] Override the from: email address coming from Airflow dev instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214131 (https://phabricator.wikimedia.org/T411536) (owner: 10Aleksandar Mastilovic) [09:20:32] (03PS2) 10Slyngshede: P:idp allow id-token claims to be enabled [puppet] - 10https://gerrit.wikimedia.org/r/1216755 (https://phabricator.wikimedia.org/T411752) [09:21:17] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7801/co" [puppet] - 10https://gerrit.wikimedia.org/r/1216755 (https://phabricator.wikimedia.org/T411752) (owner: 10Slyngshede) [09:24:02] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7802/console" [puppet] - 10https://gerrit.wikimedia.org/r/1216755 (https://phabricator.wikimedia.org/T411752) (owner: 10Slyngshede) [09:26:21] (03CR) 10Brouberol: [C:03+1] P:idp allow id-token claims to be enabled [puppet] - 10https://gerrit.wikimedia.org/r/1216755 (https://phabricator.wikimedia.org/T411752) (owner: 10Slyngshede) [09:26:34] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp allow id-token claims to be enabled [puppet] - 10https://gerrit.wikimedia.org/r/1216755 (https://phabricator.wikimedia.org/T411752) (owner: 10Slyngshede) [09:29:30] 06SRE, 10SRE-Access-Requests: Add FIDO backed production SSH key for Papaul - https://phabricator.wikimedia.org/T411833#11443060 (10jcrespo) 05Open→03Resolved This looks resolved to me, please @Papaul reopen if something else is needed. [09:32:16] 06SRE, 10SRE-Access-Requests: Requesting update of SSH key for zoe - https://phabricator.wikimedia.org/T411506#11443067 (10jcrespo) 05In progress→03Resolved [09:33:31] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11443071 (10jcrespo) p:05Triage→03High [09:37:46] (03CR) 10Harroyo-wmf: [C:03+1] Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [09:37:51] (03CR) 10Harroyo-wmf: [C:03+1] Enable IRS v2 non-emergency workflow on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216561 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [09:40:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [09:41:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11443084 (10jcrespo) We don't yet have the confirmation from Legal on file, waiting for that. @Leif_WMDE Do you need actual SSH access, or just web, like T411977 ? [09:43:51] (03PS1) 10Ayounsi: Add eqiad/codfw loopback ranges to network::infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/1216757 [09:45:30] (03Abandoned) 10Ayounsi: Add eqiad/codfw loopback ranges to network::infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/1216757 (owner: 10Ayounsi) [09:46:48] 06SRE: dpogorzelski gpg key - https://phabricator.wikimedia.org/T411993#11443091 (10DPogorzelski-WMF) 05Open→03Resolved a:03DPogorzelski-WMF [09:47:31] !log elukey@puppetserver1001 conftool action : set/pooled=true:weight=10; selector: name=ml-serve1013.eqiad.wmnet [09:49:05] (03CR) 10Tiziano Fogli: [C:03+1] Stop using puppetmaster2002 for Blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/1215548 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:50:07] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11443097 (10jcrespo) [09:50:59] (03PS1) 10Elukey: Add missing k8s config for ml-serve1013 to ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1216761 (https://phabricator.wikimedia.org/T403697) [09:51:41] (03CR) 10Elukey: [C:03+2] Add missing k8s config for ml-serve1013 to ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1216761 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [09:51:57] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11443125 (10jcrespo) p:05Triage→03High [09:53:07] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=ml-serve1013.eqiad.wmnet [09:56:01] (03CR) 10Dpogorzelski: [C:03+2] ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski) [09:57:06] (03PS1) 10Cathal Mooney: Add missing infra ranges eqiad and codfw to network data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1216762 [09:59:46] (03Abandoned) 10Klausman: installserver/partman: Add custom recipe for ml-build1001 [puppet] - 10https://gerrit.wikimedia.org/r/1214530 (https://phabricator.wikimedia.org/T394778) (owner: 10Klausman) [09:59:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216561 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [10:00:18] (03PS2) 10Cathal Mooney: Add missing infra ranges eqiad and codfw to network data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1216762 [10:00:38] dpogorzelski@cumin1003 rename (PID 2765050) is awaiting input [10:01:25] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11443145 (10Jelto) >>! In T408592#11441688, @ATitkov wrote: > Hi @Dzahn > > Sounds ok from my side. But I have question - would I still be... [10:01:51] RESOLVED: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:02:13] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1216762 (owner: 10Cathal Mooney) [10:03:00] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.rename from ml-lab1001 to ml-build1001 [10:03:29] !log dpogorzelski@cumin1003 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from ml-lab1001 to ml-build1001 [10:05:18] (03CR) 10Ayounsi: [C:03+1] Add missing infra ranges eqiad and codfw to network data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1216762 (owner: 10Cathal Mooney) [10:06:10] (03CR) 10Cathal Mooney: [C:03+2] Add missing infra ranges eqiad and codfw to network data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1216762 (owner: 10Cathal Mooney) [10:13:33] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11443187 (10WMDE-leszek) I approve this request on WMDE's end. Thank you [10:16:40] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:20:49] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11443220 (10jcrespo) Thank you! [10:23:00] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11443228 (10jcrespo) [10:23:45] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11443235 (10taavi) >>! In T408592#11443145, @Jelto wrote: >>>! In T408592#11441688, @ATitkov wrote: >> Hi @Dzahn >> >> Sounds ok from my si... [10:25:37] (03PS1) 10Dpogorzelski: ml-lab: fix preseed [puppet] - 10https://gerrit.wikimedia.org/r/1216764 [10:26:52] (03CR) 10Cathal Mooney: [C:03+1] Nokia: ensure disabled ports speed is set correctly [homer/public] - 10https://gerrit.wikimedia.org/r/1216595 (https://phabricator.wikimedia.org/T409178) (owner: 10Ayounsi) [10:27:45] (03CR) 10Ayounsi: [C:03+2] Nokia: ensure disabled ports speed is set correctly [homer/public] - 10https://gerrit.wikimedia.org/r/1216595 (https://phabricator.wikimedia.org/T409178) (owner: 10Ayounsi) [10:28:04] (03CR) 10Dpogorzelski: [C:03+2] ml-lab: fix preseed [puppet] - 10https://gerrit.wikimedia.org/r/1216764 (owner: 10Dpogorzelski) [10:29:04] (03Merged) 10jenkins-bot: Nokia: ensure disabled ports speed is set correctly [homer/public] - 10https://gerrit.wikimedia.org/r/1216595 (https://phabricator.wikimedia.org/T409178) (owner: 10Ayounsi) [10:30:59] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS trixie [10:38:18] !log set port-speed on disabled Nokia interface [10:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:12] (03Abandoned) 10Brouberol: Redirect mpic.w.o to test-kitchen.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1213505 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [10:46:23] (03Abandoned) 10Brouberol: test-kitchen: allow both mpic/test-kitchen domains in the OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1213428 (owner: 10Brouberol) [10:48:27] (03PS4) 10Brouberol: test-kitchen: Allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1213427 [10:48:27] (03PS8) 10Brouberol: test-kitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212438 (https://phabricator.wikimedia.org/T407805) [10:48:28] (03PS9) 10Brouberol: test-kitchen: drop mpic.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212432 (https://phabricator.wikimedia.org/T407805) [10:48:28] (03PS9) 10Brouberol: test-kitchen: rename the OIDC services [puppet] - 10https://gerrit.wikimedia.org/r/1212433 (https://phabricator.wikimedia.org/T407805) [10:48:29] (03PS9) 10Brouberol: mpic: delete kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212434 (https://phabricator.wikimedia.org/T407805) [10:48:31] (03PS9) 10Brouberol: Move mpic service mesh entry to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) [10:48:35] (03PS9) 10Brouberol: mpic: delete services from service list [puppet] - 10https://gerrit.wikimedia.org/r/1212436 (https://phabricator.wikimedia.org/T407805) [10:48:47] (03CR) 10CI reject: [V:04-1] test-kitchen: Allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1213427 (owner: 10Brouberol) [10:48:58] (03CR) 10CI reject: [V:04-1] test-kitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212438 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [10:49:19] (03CR) 10CI reject: [V:04-1] test-kitchen: drop mpic.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212432 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [10:49:40] (03PS5) 10Brouberol: test-kitchen: Allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1213427 [10:49:40] (03PS9) 10Brouberol: test-kitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212438 (https://phabricator.wikimedia.org/T407805) [10:49:40] (03PS10) 10Brouberol: test-kitchen: drop mpic.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212432 (https://phabricator.wikimedia.org/T407805) [10:49:40] (03PS10) 10Brouberol: test-kitchen: rename the OIDC services [puppet] - 10https://gerrit.wikimedia.org/r/1212433 (https://phabricator.wikimedia.org/T407805) [10:49:41] (03PS10) 10Brouberol: mpic: delete kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212434 (https://phabricator.wikimedia.org/T407805) [10:49:43] (03PS10) 10Brouberol: Move mpic service mesh entry to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) [10:49:47] (03PS10) 10Brouberol: mpic: delete services from service list [puppet] - 10https://gerrit.wikimedia.org/r/1212436 (https://phabricator.wikimedia.org/T407805) [10:49:52] (03CR) 10CI reject: [V:04-1] test-kitchen: rename the OIDC services [puppet] - 10https://gerrit.wikimedia.org/r/1212433 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [10:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:55:55] (03CR) 10Santiago Faci: [C:03+1] test-kitchen: Allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1213427 (owner: 10Brouberol) [10:57:11] !log dpogorzelski@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-lab1001.eqiad.wmnet with OS trixie [10:57:24] (03CR) 10Brouberol: [C:03+2] test-kitchen: Allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1213427 (owner: 10Brouberol) [10:58:46] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS trixie [11:00:07] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T1100) [11:07:52] 06SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wiki Debates - https://phabricator.wikimedia.org/T412017#11443375 (10jcrespo) Hi, @Gnangarra There is already a list called Wikidebate: https://lists.wikimedia.org/postorius/lists/wikidebate.lists.wikimedia.org/ Mostly empty. Is the request a complet... [11:30:13] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:38:32] 06SRE: dpogorzelski gpg key - https://phabricator.wikimedia.org/T411993#11443608 (10SLyngshede-WMF) For the log: I've added @DPogorzelski-WMF and re-encrypted. [11:41:34] jouncebot: nowandnext [11:41:34] For the next 0 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T1100) [11:41:34] In 1 hour(s) and 18 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T1300) [11:46:42] dpogorzelski@cumin1003 reimage (PID 2773459) is awaiting input [12:00:18] jouncebot: nowandnext [12:00:18] No deployments scheduled for the next 0 hour(s) and 59 minute(s) [12:00:18] In 0 hour(s) and 59 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T1300) [12:02:17] I am going to deploy now [12:03:31] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11443686 (10Jelto) >>! In T408592#11443235, @taavi wrote: > That being said, transfers in or out of the `/toolforge-repos` namespaces must be... [12:07:39] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11443698 (10VRiley-WMF) a:03VRiley-WMF [12:09:24] !log dpogorzelski@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS trixie [12:14:57] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:15:24] I'm done [12:30:01] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS trixie [12:39:01] 06SRE, 10Wikimedia-Mailing-lists: Reports of unsubscribe from wikitech-ambassadors failing to work - https://phabricator.wikimedia.org/T405153#11443759 (10jcrespo) 05Stalled→03Resolved a:03jcrespo With the above feedback and no issue reported since, I would consider this either resolved or invalid, b... [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T1300) [13:03:28] !log dpogorzelski@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-lab1001.eqiad.wmnet with OS trixie [13:04:19] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS trixie [13:06:33] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance [13:06:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2211 (T410589)', diff saved to https://phabricator.wikimedia.org/P86471 and previous config saved to /var/cache/conftool/dbconfig/20251209-130640-ladsgroup.json [13:06:44] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [13:09:47] (03PS21) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [13:12:00] (03PS22) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [13:17:27] (03Abandoned) 10Gehel: topic: ops-limited access [puppet] - 10https://gerrit.wikimedia.org/r/1215152 (https://phabricator.wikimedia.org/T411774) (owner: 10JavierMonton) [13:19:36] !log dpogorzelski@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-lab1001.eqiad.wmnet with OS trixie [13:19:39] (03PS1) 10Bartosz Wójtowicz: ml-services: Update experimental revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216787 (https://phabricator.wikimedia.org/T411758) [13:20:58] (03PS2) 10Bartosz Wójtowicz: ml-services: Update experimental revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216787 (https://phabricator.wikimedia.org/T411758) [13:23:43] (03PS1) 10Dpogorzelski: ml-build: fix preseed [puppet] - 10https://gerrit.wikimedia.org/r/1216789 [13:24:14] (03CR) 10CI reject: [V:04-1] ml-build: fix preseed [puppet] - 10https://gerrit.wikimedia.org/r/1216789 (owner: 10Dpogorzelski) [13:26:30] (03PS2) 10Dpogorzelski: ml-build: fix preseed [puppet] - 10https://gerrit.wikimedia.org/r/1216789 [13:28:03] (03PS3) 10Dpogorzelski: ml-build: fix preseed [puppet] - 10https://gerrit.wikimedia.org/r/1216789 [13:32:28] (03CR) 10Dpogorzelski: [C:03+2] ml-build: fix preseed [puppet] - 10https://gerrit.wikimedia.org/r/1216789 (owner: 10Dpogorzelski) [13:32:57] (03PS1) 10Gehel: LVS: set druid-coordinator to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1216793 (https://phabricator.wikimedia.org/T406222) [13:39:42] (03CR) 10Brouberol: [C:03+1] LVS: set druid-coordinator to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1216793 (https://phabricator.wikimedia.org/T406222) (owner: 10Gehel) [13:42:09] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Update experimental revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216787 (https://phabricator.wikimedia.org/T411758) (owner: 10Bartosz Wójtowicz) [13:42:25] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215663 (owner: 10PipelineBot) [13:42:49] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214642 (owner: 10PipelineBot) [13:43:24] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db1229 gradually with 4 steps - Pooling in after cloning [13:44:17] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215663 (owner: 10PipelineBot) [13:44:25] (03CR) 10Gehel: [C:03+2] LVS: set druid-coordinator to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1216793 (https://phabricator.wikimedia.org/T406222) (owner: 10Gehel) [13:45:29] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:45:43] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update experimental revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216787 (https://phabricator.wikimedia.org/T411758) (owner: 10Bartosz Wójtowicz) [13:45:54] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:47:00] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:47:07] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS trixie [13:47:30] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:47:49] (03Merged) 10jenkins-bot: ml-services: Update experimental revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216787 (https://phabricator.wikimedia.org/T411758) (owner: 10Bartosz Wójtowicz) [13:47:57] (03CR) 10Fabfur: [C:03+1] LVS: set druid-coordinator to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1216793 (https://phabricator.wikimedia.org/T406222) (owner: 10Gehel) [13:47:57] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:48:26] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:48:54] !log sudo cumin 'A:lvs-secondary-eqiad' 'systemctl restart pybal.service' - T406222 [13:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:59] T406222: Add druid coordinator service to LVS for the druid_public cluster. - https://phabricator.wikimedia.org/T406222 [13:49:22] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:50:14] ^expected, this is me deploying [13:50:20] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:52:16] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 81 connections established with conf1007.eqiad.wmnet:4001 (min=82) https://wikitech.wikimedia.org/wiki/PyBal [13:53:42] !log sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service' - T406222 [13:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:22] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:55:20] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:57:16] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 82 connections established with conf1007.eqiad.wmnet:4001 (min=82) https://wikitech.wikimedia.org/wiki/PyBal [13:59:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T1400). nyaa~ [14:00:05] stephanebisson and Tran: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] o/ [14:00:12] o/ [14:00:22] I can start [14:00:39] o/ [14:00:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216643 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson) [14:00:52] stephanebisson: I would’ve started with Tran [14:01:08] given that your change is going to take a long time to deploy with its i18n changes [14:01:15] might as well shuffle through the Beta change first w^ [14:01:16] * ^^ [14:02:08] !log dpogorzelski@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-lab1001.eqiad.wmnet with reason: host reimage [14:03:43] (03PS1) 10Gehel: LVS: set druid-coordinator to state production [puppet] - 10https://gerrit.wikimedia.org/r/1216797 (https://phabricator.wikimedia.org/T406222) [14:04:35] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:05:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [14:08:24] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-lab1001.eqiad.wmnet with reason: host reimage [14:09:16] (03Merged) 10jenkins-bot: Article search: surface nominated collections (JSON files) [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1216643 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson) [14:09:35] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1216643|Article search: surface nominated collections (JSON files) (T408842)]] [14:09:38] T408842: Surface nominated collections in Search view - https://phabricator.wikimedia.org/T408842 [14:11:38] dpogorzelski@cumin1003 reimage (PID 2796517) is awaiting input [14:14:35] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1216797 (https://phabricator.wikimedia.org/T406222) (owner: 10Gehel) [14:14:55] (03CR) 10Ayounsi: "lgtm, to be deployed once the new loopback is active." [puppet] - 10https://gerrit.wikimedia.org/r/1216679 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [14:15:06] (03CR) 10Gehel: [C:03+2] LVS: set druid-coordinator to state production [puppet] - 10https://gerrit.wikimedia.org/r/1216797 (https://phabricator.wikimedia.org/T406222) (owner: 10Gehel) [14:15:44] (03CR) 10Ayounsi: [C:03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/1216677 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [14:16:40] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:52] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1229 gradually with 4 steps - Pooling in after cloning [14:40:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [14:41:25] RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:47:31] (03PS1) 10Cathal Mooney: Eqiad C/D: Remove ESI-LAG config for Nokia connections to Juniper VCs [homer/public] - 10https://gerrit.wikimedia.org/r/1216802 (https://phabricator.wikimedia.org/T411781) [14:48:32] (03PS2) 10Cathal Mooney: Eqiad C/D: Remove ESI-LAG config for Nokia connections to Juniper VCs [homer/public] - 10https://gerrit.wikimedia.org/r/1216802 (https://phabricator.wikimedia.org/T411781) [14:50:07] (Docker has been trying to push the new image version for over half an hour now ._.) [14:51:15] (03PS1) 10Elukey: profile::amd_gpu: install amd-smi only when needed [puppet] - 10https://gerrit.wikimedia.org/r/1216803 [14:51:50] (03PS1) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216804 (https://phabricator.wikimedia.org/T409438) [14:52:46] (03CR) 10Dpogorzelski: [C:03+1] profile::amd_gpu: install amd-smi only when needed [puppet] - 10https://gerrit.wikimedia.org/r/1216803 (owner: 10Elukey) [14:52:51] Lucas_WMDE: o/ trying in the sense of failing and retrying, or something different? [14:53:26] (03CR) 10Elukey: [C:03+2] profile::amd_gpu: install amd-smi only when needed [puppet] - 10https://gerrit.wikimedia.org/r/1216803 (owner: 10Elukey) [14:53:28] (03CR) 10Klausman: [C:03+1] profile::amd_gpu: install amd-smi only when needed [puppet] - 10https://gerrit.wikimedia.org/r/1216803 (owner: 10Elukey) [14:55:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:57:09] elukey: trying in the sense of there had been no output in /var/lib/spiderpig/scap-image-build-and-push-log since then [14:57:14] though apparently something moved now [14:57:21] “Waiting 300 seconds for swift after full mediawiki image build (T390251)” [14:57:23] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [14:57:31] so, the usual, backports with i18n take a very long time [14:57:58] super :) [14:57:59] (until bvibber deliver us from evil) [15:00:05] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T1500) [15:01:35] (03CR) 10Ayounsi: [C:03+1] Eqiad C/D: Remove ESI-LAG config for Nokia connections to Juniper VCs [homer/public] - 10https://gerrit.wikimedia.org/r/1216802 (https://phabricator.wikimedia.org/T411781) (owner: 10Cathal Mooney) [15:05:27] !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1216643|Article search: surface nominated collections (JSON files) (T408842)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:05:31] T408842: Surface nominated collections in Search view - https://phabricator.wikimedia.org/T408842 [15:06:10] !log sbisson@deploy2002 sbisson: Continuing with sync [15:06:36] (03PS1) 10Slyngshede: Meta IP location changes [dns] - 10https://gerrit.wikimedia.org/r/1216806 [15:07:46] (03CR) 10Ssingh: [C:03+1] "Thanks for taking care of this." [dns] - 10https://gerrit.wikimedia.org/r/1216806 (owner: 10Slyngshede) [15:10:02] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:50] (03CR) 10Fabfur: [C:03+1] Meta IP location changes [dns] - 10https://gerrit.wikimedia.org/r/1216806 (owner: 10Slyngshede) [15:12:38] (03PS1) 10Giuseppe Lavagetto: geomaps: send traffic intended for DRMRS to EQIAD [dns] - 10https://gerrit.wikimedia.org/r/1216810 [15:14:26] (03CR) 10Clément Goubert: [C:03+1] geomaps: send traffic intended for DRMRS to EQIAD [dns] - 10https://gerrit.wikimedia.org/r/1216810 (owner: 10Giuseppe Lavagetto) [15:15:15] !log restarting ATS on cp3074 [15:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:23] (03CR) 10JHathaway: [C:03+1] geomaps: send traffic intended for DRMRS to EQIAD [dns] - 10https://gerrit.wikimedia.org/r/1216810 (owner: 10Giuseppe Lavagetto) [15:17:06] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216787 (https://phabricator.wikimedia.org/T411758) (owner: 10Bartosz Wójtowicz) [15:17:08] (03CR) 10Elukey: [C:03+1] geomaps: send traffic intended for DRMRS to EQIAD [dns] - 10https://gerrit.wikimedia.org/r/1216810 (owner: 10Giuseppe Lavagetto) [15:18:27] (03PS1) 10Bking: wdqs: Move temp hosts to wdqs::test role [puppet] - 10https://gerrit.wikimedia.org/r/1216811 (https://phabricator.wikimedia.org/T410406) [15:18:40] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1216811 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [15:19:00] !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1216643|Article search: surface nominated collections (JSON files) (T408842)]] (duration: 69m 26s) [15:19:04] T408842: Surface nominated collections in Search view - https://phabricator.wikimedia.org/T408842 [15:19:06] (out of curiosity, what’s the one host that’s apparently left in the sync-apaches phase? ^^) [15:19:14] Tran: if you’re still around we could deploy your beta change now [15:19:38] “Finished scap sync-world (duration: 69m 26s)” not nice :< [15:19:53] (03CR) 10Gehel: [C:04-1] wdqs: Move temp hosts to wdqs::test role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1216811 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [15:19:55] I'm around yeah [15:20:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216561 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [15:20:17] ok let’s go [15:20:57] alright, I can. Beta changes can go in w/spider pig, same as any other config change right? [15:21:02] (03PS2) 10Bking: wdqs: Move temp hosts to wdqs::test role [puppet] - 10https://gerrit.wikimedia.org/r/1216811 (https://phabricator.wikimedia.org/T410406) [15:21:02] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216812 (https://phabricator.wikimedia.org/T128546) [15:21:07] (03Merged) 10jenkins-bot: Enable IRS v2 non-emergency workflow on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216561 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [15:21:18] sorry, I already started the spiderpig [15:21:18] but yes [15:21:29] aaaand it’s done https://spiderpig.wikimedia.org/jobs/1056 [15:21:36] should take effect on beta in ca. 10 minutes [15:21:36] (03CR) 10CI reject: [V:04-1] wdqs: Move temp hosts to wdqs::test role [puppet] - 10https://gerrit.wikimedia.org/r/1216811 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [15:21:43] whoop, well thanks, I won't complain! 🙇 [15:22:12] ^^ [15:22:14] !log UTC afternoon backport+config window done [15:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:03] (03PS3) 10Bking: wdqs: Move temp hosts to wdqs::test role [puppet] - 10https://gerrit.wikimedia.org/r/1216811 (https://phabricator.wikimedia.org/T410406) [15:25:34] (03PS4) 10Bking: wdqs: Move temp hosts to wdqs::test role [puppet] - 10https://gerrit.wikimedia.org/r/1216811 (https://phabricator.wikimedia.org/T410406) [15:26:50] (03CR) 10Bking: wdqs: Move temp hosts to wdqs::test role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1216811 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [15:27:31] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1216811 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T1530) [15:30:13] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:35:02] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:29] (03CR) 10Gehel: wdqs: Move temp hosts to wdqs::test role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1216811 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [15:45:27] (03CR) 10CDanis: [C:03+2] geomaps: send traffic intended for DRMRS to EQIAD [dns] - 10https://gerrit.wikimedia.org/r/1216810 (owner: 10Giuseppe Lavagetto) [15:45:50] !log cdanis@dns3003 START - running authdns-update [15:47:04] !log cdanis@dns3003 END - running authdns-update [15:48:00] (03PS1) 10Cathal Mooney: Nokia disabled port speeds setting - only set on D2L model [homer/public] - 10https://gerrit.wikimedia.org/r/1216822 (https://phabricator.wikimedia.org/T409178) [15:59:16] (03PS1) 10Ryan Kemper: ryankemper: fido-based ssh access [puppet] - 10https://gerrit.wikimedia.org/r/1216823 (https://phabricator.wikimedia.org/T412126) [16:00:05] jelto, arnoldokoth, mutante, and arnaudb: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for SRE Collaboration Services office hours . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T1600). [16:00:42] (03PS1) 10Cathal Mooney: Nokia sflow: disable on spine switch access/trunk ports [homer/public] - 10https://gerrit.wikimedia.org/r/1216824 [16:02:15] (03PS1) 10Ryan Kemper: wdqs: correct deploy tag and add codfw as site [alerts] - 10https://gerrit.wikimedia.org/r/1216825 (https://phabricator.wikimedia.org/T389859) [16:10:02] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:12:08] (03PS1) 10Klausman: aptrepo: Add ROCm 6.1 package resources to trixie [puppet] - 10https://gerrit.wikimedia.org/r/1216826 [16:13:03] (03PS2) 10Klausman: aptrepo: Add ROCm 6.1 package resources to trixie [puppet] - 10https://gerrit.wikimedia.org/r/1216826 [16:14:57] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:20:39] !log Remove varnishkafka from trixie-wikimedia - T401832 [16:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:42] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [16:33:48] FIRING: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:35:54] (03CR) 10Ayounsi: [C:03+1] Nokia disabled port speeds setting - only set on D2L model [homer/public] - 10https://gerrit.wikimedia.org/r/1216822 (https://phabricator.wikimedia.org/T409178) (owner: 10Cathal Mooney) [16:38:22] (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1216680 (owner: 10Ncmonitor) [16:38:57] !log brett@dns1006 START - running authdns-update [16:39:52] !log brett@dns1006 END - running authdns-update [16:43:27] (03PS3) 10Klausman: aptrepo: Expand ROCm 7.0.2 packagelist to full set [puppet] - 10https://gerrit.wikimedia.org/r/1216826 [16:45:50] (03PS1) 10Brouberol: dse-k8s-codfw: enable pod-to-pod traffic cluster-wide [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216832 (https://phabricator.wikimedia.org/T408643) [16:53:27] (03CR) 10Cathal Mooney: [C:03+2] Nokia disabled port speeds setting - only set on D2L model [homer/public] - 10https://gerrit.wikimedia.org/r/1216822 (https://phabricator.wikimedia.org/T409178) (owner: 10Cathal Mooney) [16:54:52] (03Merged) 10jenkins-bot: Nokia disabled port speeds setting - only set on D2L model [homer/public] - 10https://gerrit.wikimedia.org/r/1216822 (https://phabricator.wikimedia.org/T409178) (owner: 10Cathal Mooney) [16:55:52] (03CR) 10Ssingh: C:mtail update trafficserver_backend_requests_seconds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [16:57:35] (03PS4) 10Klausman: homer/puppetmaster: Make sure the commitmsg hook does not double-add user [puppet] - 10https://gerrit.wikimedia.org/r/1204832 [17:00:05] jhathaway and rzl: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:04:38] (03CR) 10RLazarus: [C:03+1] homer/puppetmaster: Make sure the commitmsg hook does not double-add user [puppet] - 10https://gerrit.wikimedia.org/r/1204832 (owner: 10Klausman) [17:04:53] (03CR) 10Klausman: [C:03+2] homer/puppetmaster: Make sure the commitmsg hook does not double-add user [puppet] - 10https://gerrit.wikimedia.org/r/1204832 (owner: 10Klausman) [17:06:20] FIRING: [2x] PfwCoreBGPDown: Fundraising Firewall core BGP session down between pfw1-codfw and (null) (10.195.0.248) - group VPN - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [17:18:19] (03CR) 10Bking: [C:03+2] dse-k8s-codfw: enable pod-to-pod traffic cluster-wide [deployment-charts] - 10https://gerrit.wikimedia.org/r/1216832 (https://phabricator.wikimedia.org/T408643) (owner: 10Brouberol) [17:18:21] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:18:21] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:19:35] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [17:19:57] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [17:20:27] 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for ryankemper - https://phabricator.wikimedia.org/T412126#11444711 (10A_smart_kitten) [17:31:09] 06SRE: Getting forbidden from public CI runners on forgejo with opendatasync - https://phabricator.wikimedia.org/T412142 (10EvanCarroll) 03NEW [17:37:41] RESOLVED: PfwCoreBGPDown: Fundraising Firewall core BGP session down between pfw1-eqiad and (null) (10.195.0.249) - group VPN - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=pfw1-eqiad:9804&var-bgp_group=VPN&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [17:41:26] 06SRE, 06Traffic: Getting forbidden from public CI runners on forgejo with opendatasync - https://phabricator.wikimedia.org/T412142#11444796 (10Reedy) [17:42:28] (03CR) 10Dzahn: [C:03+1] "we also need to add donate.wikipedia25.org here" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [17:43:21] (03CR) 10CDobbins: [V:03+2] sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 (owner: 10CDobbins) [17:43:54] (03CR) 10CDobbins: [V:03+2 C:03+2] sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 (owner: 10CDobbins) [17:49:06] (03Merged) 10jenkins-bot: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 (owner: 10CDobbins) [17:58:54] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11444901 (10KFrancis) Hi all, confirming the NDA is complete! Thanks! [17:59:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251209T1800) [18:02:31] (03PS1) 10Brouberol: Growthbook: setup OIDC login [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215525 (https://phabricator.wikimedia.org/T411752) [18:02:32] 06SRE, 06Traffic: Getting forbidden from public CI runners on forgejo with opendatasync - https://phabricator.wikimedia.org/T412142#11444919 (10taavi) Please provide the full HTTP response body, which either details why the request is blocked or contains an internal identifier for us to locate the relevant WAF... [18:02:41] (03PS4) 10Dzahn: miscweb: add wikipedia25.org to extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) [18:04:43] (03CR) 10Dzahn: [C:03+2] miscweb: add wikipedia25.org to extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [18:05:53] (03CR) 10Superpes15: [ukwiki] Limit thanks for newbies to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215262 (https://phabricator.wikimedia.org/T411588) (owner: 10Superpes15) [18:11:28] (03PS2) 10Superpes15: [enwikibooks] Allow sysops to revert abusefilter and grant/revoke some flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215625 (https://phabricator.wikimedia.org/T411828) [18:12:00] (03Merged) 10jenkins-bot: miscweb: add wikipedia25.org to extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [18:13:26] (03PS1) 10LorenMora: [Legal Footer] Deploy Legal Footer for Phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216839 (https://phabricator.wikimedia.org/T410164) [18:21:27] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11444981 (10KFrancis) Hi all, I have sent the NDA out to be signed. I'll confirm when it's complete. [18:21:45] !log dzahn@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:22:16] !log dzahn@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:22:34] !log dzahn@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [18:23:12] !log dzahn@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [18:24:05] !log dzahn@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:24:28] !log dzahn@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:24:36] !log dzahn@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [18:25:07] !log dzahn@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [18:25:13] (03CR) 10Dzahn: [C:03+2] "I deployed this following https://wikitech.wikimedia.org/wiki/Kubernetes/Remove_a_service#Deploy_changes_to_helmfile.d/admin_ng" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [18:27:50] (03CR) 10Ayounsi: [C:03+1] Nokia sflow: disable on spine switch access/trunk ports [homer/public] - 10https://gerrit.wikimedia.org/r/1216824 (owner: 10Cathal Mooney) [18:28:08] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://docs.trafficserver.apache.org/en/9.0.x/admin-guide/files/sni.yaml.en.html" [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [18:28:31] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics_privatedata_users and Superset for Solenne_Lazare_WMDE - https://phabricator.wikimedia.org/T411977#11444989 (10Solenne_Lazare_WMDE) @jcrespo Done for the LDAP profile added to my phabircator profile @KFrancis also just sig... [18:32:36] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 17 hosts with reason: upgradiing sr-linux on Nokia switches codfw [18:32:44] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux ARP resolution bug on v24.10.x+ - https://phabricator.wikimedia.org/T409178#11445009 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2a98251c-6798-469c-a3de-57fcfb13969f) set by cmooney@cumin1003 for 2:00:00 on 17 host(s) and t... [18:36:43] 06SRE, 06Traffic: Getting forbidden from public CI runners on forgejo with opendatasync - https://phabricator.wikimedia.org/T412142#11445020 (10EvanCarroll) `

Our servers are currently under maintenance or experiencing a technical issue