[00:05:17] (03PS1) 10Chlod Alejandro: tlwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) [00:08:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183751 [00:08:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183751 (owner: 10TrainBranchBot) [00:15:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403356#11137589 (10phaultfinder) [00:30:34] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183751 (owner: 10TrainBranchBot) [00:31:46] FIRING: Traffic bill over quota: Alert for device cr3-ulsfo.wikimedia.org - Traffic bill over quota Has worsened - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:36:46] FIRING: [3x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:38:58] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11137605 (10phaultfinder) [00:41:34] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2201.codfw.wmnet with reason: Maintenance [00:43:53] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11137606 (10phaultfinder) [00:44:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403356#11137607 (10phaultfinder) [00:51:46] FIRING: [3x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:56:46] RESOLVED: [2x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:00:53] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:01:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:07:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.17 [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183752 (https://phabricator.wikimedia.org/T396378) [01:07:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.17 [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183752 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot) [01:12:48] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 55s) [01:22:50] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-aux_30443: Servers aux-k8s-worker1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:23:50] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:25:01] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.17 [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183752 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot) [01:29:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:59:20] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2231.codfw.wmnet with reason: Maintenance [01:59:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2231 (T403362)', diff saved to https://phabricator.wikimedia.org/P82357 and previous config saved to /var/cache/conftool/dbconfig/20250902-015927-ladsgroup.json [01:59:31] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0200) [02:27:36] (03PS1) 10DDesouza: Pre-deploy Newcomers survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183753 (https://phabricator.wikimedia.org/T402915) [02:29:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183753 (https://phabricator.wikimedia.org/T402915) (owner: 10DDesouza) [02:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [02:39:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403356#11137669 (10phaultfinder) [02:57:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2231 (T403362)', diff saved to https://phabricator.wikimedia.org/P82358 and previous config saved to /var/cache/conftool/dbconfig/20250902-025704-ladsgroup.json [02:57:07] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [02:59:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403356#11137676 (10phaultfinder) [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0300) [03:02:01] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183755 (https://phabricator.wikimedia.org/T396378) [03:02:03] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183755 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot) [03:02:53] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183755 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot) [03:03:17] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.17 refs T396378 [03:03:20] T396378: 1.45.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T396378 [03:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:12:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2231', diff saved to https://phabricator.wikimedia.org/P82359 and previous config saved to /var/cache/conftool/dbconfig/20250902-031211-ladsgroup.json [03:27:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2231', diff saved to https://phabricator.wikimedia.org/P82360 and previous config saved to /var/cache/conftool/dbconfig/20250902-032719-ladsgroup.json [03:42:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2231 (T403362)', diff saved to https://phabricator.wikimedia.org/P82361 and previous config saved to /var/cache/conftool/dbconfig/20250902-034226-ladsgroup.json [03:42:30] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [03:47:08] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.17 refs T396378 (duration: 43m 50s) [03:47:11] T396378: 1.45.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T396378 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0400) [04:01:14] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.14 (duration: 01m 04s) [04:03:54] (03CR) 10Anzx: [C:03+1] tlwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) (owner: 10Chlod Alejandro) [04:10:47] (03CR) 10Anzx: [C:03+1] "Please add comment_1_5x and comment_2x with task ID, It's good to associate the logo change with the specific task it relates to." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) (owner: 10Chlod Alejandro) [04:11:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:16:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:25:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:25:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:26:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:39:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183703 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [04:43:54] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11137713 (10phaultfinder) [04:48:52] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11137716 (10phaultfinder) [05:01:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:07:47] (03PS1) 10KartikMistry: cxserver: staging: Update to 2025-09-02-045916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183761 (https://phabricator.wikimedia.org/T394982) [05:08:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:21:22] RECOVERY - mysqld processes on es2026 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:22:24] RECOVERY - MariaDB read only es2 on es2026 is OK: Version 10.11.13-MariaDB-log, Uptime 66s, read_only: True, event_scheduler: True, 11.30 QPS, connection latency: 0.021679s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:25:05] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2026 gradually with 4 steps - Pool es2026.codfw.wmnet in after cloning [05:29:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:31:12] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 134 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:33:40] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0600). [06:04:21] (03CR) 10Arnaudb: [C:03+2] "sampling traffic to confirm analysis in T402611#11131164" [puppet] - 10https://gerrit.wikimedia.org/r/1183698 (owner: 10Arnaudb) [06:04:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:08:48] (03PS1) 10Arnaudb: Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1183766 [06:10:34] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2026 gradually with 4 steps - Pool es2026.codfw.wmnet in after cloning [06:10:35] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es2026.codfw.wmnet onto es2049.codfw.wmnet [06:23:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:28:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:31:10] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:32:45] FIRING: [2x] Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota Has worsened - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [06:34:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1183707 (https://phabricator.wikimedia.org/T403154) (owner: 10Filippo Giunchedi) [06:37:45] FIRING: [3x] Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota Has worsened - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:41:47] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11137799 (10MoritzMuehlenhoff) All five replicas on maps-test have been re-synched and the Postgres log files look good now. [06:42:36] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [06:46:50] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for esams03 - jmm@cumin2002" [06:46:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11137802 (10MoritzMuehlenhoff) [06:47:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for esams03 - jmm@cumin2002" [06:47:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:49:35] (03CR) 10Muehlenhoff: [C:03+2] Assign ganeti_routed role to ganeti3006 and configure cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/1183704 (owner: 10Muehlenhoff) [06:49:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11137808 (10ayounsi) [06:50:35] (03CR) 10Elukey: [C:03+2] profile::amd_gpu: add a flag to deploy firmwares from Bookworm BPO (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183678 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [06:50:45] (03CR) 10Elukey: [C:03+2] Delete profile::python38 [puppet] - 10https://gerrit.wikimedia.org/r/1183680 (owner: 10Elukey) [06:51:01] (03CR) 10Elukey: [V:03+1 C:03+2] Add a new insetup role for ml-k8s hosts to test their GPU [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [06:51:10] (03PS6) 10Elukey: Add a new insetup role for ml-k8s hosts to test their GPU [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) [06:52:45] FIRING: [3x] Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota Has worsened - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:52:46] (03CR) 10Elukey: [C:03+2] Add a new insetup role for ml-k8s hosts to test their GPU [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [06:53:50] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:54:50] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:57:45] RESOLVED: Traffic bill over quota: Alert for device cr2-esams.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [07:00:04] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0700). [07:00:05] hueitan and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:19] o/ [07:00:22] here [07:00:34] I'll start with hueitan's change.. [07:00:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183692 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [07:03:17] (03Merged) 10jenkins-bot: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183692 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [07:03:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet [07:04:00] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1183692|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] [07:04:03] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [07:04:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183703 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [07:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:05:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183703 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [07:10:28] !log kartik@deploy1003 hueitan, kartik: Backport for [[gerrit:1183692|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:10:31] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [07:11:05] hueitan: you can test the patch on wmf.16 Wikis. [07:13:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet [07:13:28] !log kartik@deploy1003 hueitan, kartik: Continuing with sync [07:14:17] (03PS1) 10Huei Tan: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183973 (https://phabricator.wikimedia.org/T402496) [07:14:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183973 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [07:19:34] !log create ganeti03 cluster T402259 [07:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:37] T402259: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259 [07:20:44] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183692|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] (duration: 16m 43s) [07:20:47] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [07:21:26] hueitan: I'll deploy second patch once CI is passed. [07:25:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183973 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [07:25:53] (03CR) 10Arnaudb: [C:03+2] Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1183766 (owner: 10Arnaudb) [07:26:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11137841 (10MoritzMuehlenhoff) [07:27:07] (03Merged) 10jenkins-bot: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183973 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [07:27:36] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1183973|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] [07:27:39] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [07:33:02] jouncebot: nowandnext [07:33:02] For the next 0 hour(s) and 26 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0700) [07:33:03] In 2 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1000) [07:33:30] !log kartik@deploy1003 hueitan, kartik: Backport for [[gerrit:1183973|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:33:34] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [07:34:09] (03CR) 10Slyngshede: [C:03+2] P:puppetserver::volatile enable datacenter timer [puppet] - 10https://gerrit.wikimedia.org/r/1183612 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:36:16] !log kartik@deploy1003 hueitan, kartik: Continuing with sync [07:37:14] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11137855 (10TheDJ) There should probably be an alert for `could not receive data from WAL stream`.. there's at least 3 old closed tickets in phab with exactly the same log line and s... [07:39:13] (03PS1) 10Muehlenhoff: Add esams03 to Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1183978 (https://phabricator.wikimedia.org/T402259) [07:39:58] (03PS1) 10Kosta Harlan: hCaptcha: Set log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183979 [07:41:14] (03CR) 10Ayounsi: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1183978 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [07:41:35] Could I slot in for a deploy after this one? [07:41:48] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183973|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] (duration: 14m 12s) [07:41:52] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [07:42:10] Mvolz sure, we almost done. [07:42:20] kart_will let us k now [07:42:23] cool [07:42:36] (03PS2) 10Kosta Harlan: hCaptcha: Set log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183979 [07:42:48] (03PS3) 10Kosta Harlan: hCaptcha: Set log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183979 [07:43:03] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11137859 (10MoritzMuehlenhoff) >>! In T381565#11137855, @TheDJ wrote: > There should probably be an alert for `could not receive data from WAL stream`.. there's at least 3 old closed... [07:43:40] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:43:41] (03CR) 10Muehlenhoff: [C:03+2] Add esams03 to Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1183978 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [07:47:17] (03CR) 10Vgutierrez: [C:04-1] "current approach will alter our metrics, please track wdqs as `ua_policy:wdqs` (same as in Varnish)" [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (owner: 10Slyngshede) [07:48:40] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:49:28] kart_: looks like scap is done? Would it be okay for me to start my config change? [07:53:40] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:54:44] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [07:58:40] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:59:31] Mvolz: sorry, missed msg [07:59:36] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:59:38] Mvolz: Please go ahead. [08:03:40] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:04:47] (03PS2) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) [08:04:50] (03CR) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [08:05:40] (03PS1) 10Muehlenhoff: Add replacement insetup VMS for VMs currently running on esams01 [puppet] - 10https://gerrit.wikimedia.org/r/1184031 (https://phabricator.wikimedia.org/T402259) [08:12:21] eh, windows over, i'll do it some other time. [08:17:02] 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11137889 (10JMeybohm) >> Also these IPs/hosts might and will change in the future so they would have to be updates regularly. > > How often might that happen? I can... [08:17:39] (03CR) 10Filippo Giunchedi: [C:03+2] java: add support for Trixie / Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1183707 (https://phabricator.wikimedia.org/T403154) (owner: 10Filippo Giunchedi) [08:17:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:24:26] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:25:10] (03PS2) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1183975 (https://phabricator.wikimedia.org/T402611) [08:25:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:25:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:26:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:27:35] 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11137915 (10MoritzMuehlenhoff) >>! In T403298#11137889, @JMeybohm wrote: >> How often might that happen? > I can't say for sure. Definitely for every Debian OS versi... [08:40:05] (03PS1) 10Arnaudb: Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1184032 [08:40:49] FIRING: PuppetFailure: Puppet has failed on ml-serve1013:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:42:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:43:18] (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1184032 (owner: 10Arnaudb) [08:43:29] (03PS1) 10Muehlenhoff: Revert "Line-wrap Homer diffs" [puppet] - 10https://gerrit.wikimedia.org/r/1184033 [08:47:01] !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1083.eqiad.wmnet with OS bullseye [08:47:31] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1083.eqiad.wmnet... [08:47:53] !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1084.eqiad.wmnet with OS bullseye [08:48:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137956 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1084.eqiad.wmnet... [08:48:24] (03CR) 10Ayounsi: [C:03+1] Add replacement insetup VMS for VMs currently running on esams01 [puppet] - 10https://gerrit.wikimedia.org/r/1184031 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [08:48:57] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11137957 (10phaultfinder) [08:50:47] !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1085.eqiad.wmnet with OS bullseye [08:51:01] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137958 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1085.eqiad.wmnet... [08:53:15] (03CR) 10Muehlenhoff: [C:03+2] Add replacement insetup VMS for VMs currently running on esams01 [puppet] - 10https://gerrit.wikimedia.org/r/1184031 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [08:53:52] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11137960 (10phaultfinder) [08:54:56] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [08:57:48] (03PS1) 10Elukey: apt: add the non-free-firmware component for Bookworm bpo [puppet] - 10https://gerrit.wikimedia.org/r/1184035 [09:00:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:01:27] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184033 (owner: 10Muehlenhoff) [09:01:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:17] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11137967 (10ayounsi) [09:03:27] (03CR) 10Ayounsi: [C:03+1] "indeed, I missed that the issue is with rancid and not homer" [puppet] - 10https://gerrit.wikimedia.org/r/1184033 (owner: 10Muehlenhoff) [09:05:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:06:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3005.esams.wmnet [09:06:26] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:07:06] (03CR) 10Vgutierrez: P:cache::haproxy disallow Wikidata Query Service as UA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [09:07:19] (03PS1) 10Slyngshede: P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) [09:08:06] (03CR) 10CI reject: [V:04-1] P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:08:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1184035 (owner: 10Elukey) [09:08:54] (03CR) 10Muehlenhoff: [C:03+2] Revert "Line-wrap Homer diffs" [puppet] - 10https://gerrit.wikimedia.org/r/1184033 (owner: 10Muehlenhoff) [09:10:30] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [09:10:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [09:10:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:10:36] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors [09:10:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors [09:10:55] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:10:56] !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1084.eqiad.wmnet with reason: host reimage [09:11:33] (03CR) 10Elukey: [C:03+2] apt: add the non-free-firmware component for Bookworm bpo [puppet] - 10https://gerrit.wikimedia.org/r/1184035 (owner: 10Elukey) [09:13:17] !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1085.eqiad.wmnet with reason: host reimage [09:14:17] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1084.eqiad.wmnet with reason: host reimage [09:14:45] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [09:14:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [09:14:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:14:51] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors [09:14:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors [09:14:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir3005.esams.wmnet [09:15:18] (03PS2) 10Chlod Alejandro: tlwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) [09:15:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3005.esams.wmnet [09:15:57] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:16:00] (03CR) 10Chlod Alejandro: "Done! Didn't know the `comment` doesn't automatically add the comments for those two in. Perhaps another change can be made for that in `m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) (owner: 10Chlod Alejandro) [09:17:56] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1085.eqiad.wmnet with reason: host reimage [09:19:05] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11138061 (10elukey) I tested the following and I see the correct image: ` curl -s "https://kartotherian.svc.codfw.wmnet:6543/img/osm-intl,14,a,a,300x200.png?lang=en&domain=en.wikiped... [09:19:29] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [09:19:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [09:19:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:19:35] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors [09:19:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors [09:19:53] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:21:20] (03PS1) 10Tiziano Fogli: MysqlSustainedReplLag: replace Icinga-based PromQL checks [alerts] - 10https://gerrit.wikimedia.org/r/1184039 (https://phabricator.wikimedia.org/T315866) [09:23:17] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [09:23:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [09:23:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:23:24] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors [09:23:26] (03CR) 10Anzx: [C:03+1] tlwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) (owner: 10Chlod Alejandro) [09:23:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors [09:23:31] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir3005.esams.wmnet [09:23:44] !log mvernon@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1083.eqiad.wmnet with OS bullseye [09:23:47] PROBLEM - Host ml-serve1013 is DOWN: PING CRITICAL - Packet loss = 100% [09:24:00] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11138067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1083.eqiad.wmnet wit... [09:24:24] (03CR) 10Brouberol: [C:03+1] Use the standby analytics_meta mariadb server temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1183642 (https://phabricator.wikimedia.org/T394498) (owner: 10Btullis) [09:24:28] !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1083.eqiad.wmnet with OS bullseye [09:24:47] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11138068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1083.eqiad.wmnet... [09:25:15] RECOVERY - Host ml-serve1013 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [09:25:48] RESOLVED: PuppetFailure: Puppet has failed on ml-serve1013:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:26:26] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [09:27:00] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [09:27:05] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [09:27:56] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [09:28:37] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [09:29:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:29:47] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [09:29:55] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [09:31:17] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1084.eqiad.wmnet with OS bullseye [09:31:38] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11138092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1084.eqiad.wmnet wit... [09:31:51] (03PS2) 10Slyngshede: P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) [09:32:18] (03CR) 10CI reject: [V:04-1] P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:33:11] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 134 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:33:42] (03PS3) 10Slyngshede: P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) [09:33:54] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1085.eqiad.wmnet with OS bullseye [09:34:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11138096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1085.eqiad.wmnet wit... [09:37:06] (03PS1) 10Filippo Giunchedi: openstack: add wmcs-server-id [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) [09:38:36] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [09:40:22] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6826/co" [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:40:40] RESOLVED: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:42:13] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11138109 (10elukey) @TheDJ Hi! As FYI I just repooled maps codfw, we don't see anymore issues but please let us know if you see anything weird. Thanks! [09:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:58:18] (03PS1) 10Gkyziridis: ml-services: Fix KServe batcher setup for edit-check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) [09:58:21] (03PS1) 10Btullis: Upgrade the dse-k8s-codfw cluster to version 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1184043 (https://phabricator.wikimedia.org/T396478) [09:59:40] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1184043 (https://phabricator.wikimedia.org/T396478) (owner: 10Btullis) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1000) [10:02:57] 06SRE: offboard-user: Check for use of email address of user to be offboarded across Puppet repo - https://phabricator.wikimedia.org/T403452 (10Aklapper) 03NEW [10:04:43] PROBLEM - MariaDB Replica IO: analytics_meta on db1208 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: could not find next log: the first event analytics-meta-bin.000197 at 258619100, the last event read from analytics-meta-bin.000271 at 667686198, the last byte read from analytics-meta-bin.000271 at 667686229. https://wikitech.wikimedia.or [10:04:43] ariaDB/troubleshooting%23Depooling_a_replica [10:06:23] (03CR) 10Btullis: [C:03+2] Use the standby analytics_meta mariadb server temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1183642 (https://phabricator.wikimedia.org/T394498) (owner: 10Btullis) [10:10:49] (03CR) 10Ilias Sarantopoulos: "Thanks for spotting this! can you also add the change to the experimental namespace?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) (owner: 10Gkyziridis) [10:11:22] (03PS1) 10Aklapper: offboard-user: Remove "Security" from privileged Phabricator projects [puppet] - 10https://gerrit.wikimedia.org/r/1184044 [10:11:56] (03CR) 10Btullis: [C:03+2] Facilitate a role swap between an-mariadb1001 and an-mariadb1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183643 (https://phabricator.wikimedia.org/T394498) (owner: 10Btullis) [10:12:39] (03CR) 10Stevemunene: [C:03+1] Upgrade the dse-k8s-codfw cluster to version 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1184043 (https://phabricator.wikimedia.org/T396478) (owner: 10Btullis) [10:13:43] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d8-eqiad - https://phabricator.wikimedia.org/T401240#11138223 (10VRiley-WMF) |Device A|Device A Port|Device B|Device B Port|Type|Notes|Length required| |----------|-----------------|----------|----------|-------|-----|-------------... [10:13:58] (03Merged) 10jenkins-bot: Facilitate a role swap between an-mariadb1001 and an-mariadb1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183643 (https://phabricator.wikimedia.org/T394498) (owner: 10Btullis) [10:14:48] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [10:15:18] !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1083.eqiad.wmnet with reason: host reimage [10:15:38] (03PS1) 10Vgutierrez: hiera: Enable JA3N fingerprinting CDN wide [puppet] - 10https://gerrit.wikimedia.org/r/1184046 (https://phabricator.wikimedia.org/T400119) [10:16:42] (03PS2) 10Vgutierrez: hiera: Enable JA3N fingerprinting CDN wide [puppet] - 10https://gerrit.wikimedia.org/r/1184046 (https://phabricator.wikimedia.org/T400270) [10:16:51] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [10:17:23] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [10:17:31] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184046 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [10:19:12] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [10:19:12] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1083.eqiad.wmnet with reason: host reimage [10:19:36] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [10:21:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2150.codfw.wmnet with reason: Maintenance [10:21:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T401906)', diff saved to https://phabricator.wikimedia.org/P82368 and previous config saved to /var/cache/conftool/dbconfig/20250902-102130-fceratto.json [10:21:36] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [10:21:38] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [10:21:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet [10:22:05] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [10:23:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T401906)', diff saved to https://phabricator.wikimedia.org/P82369 and previous config saved to /var/cache/conftool/dbconfig/20250902-102353-fceratto.json [10:24:55] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [10:25:06] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [10:28:05] (03CR) 10Hnowlan: [C:03+2] rest-gateway: Add rest-gateway-ro domain matchers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183086 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [10:28:23] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [10:29:18] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [10:30:09] (03Merged) 10jenkins-bot: rest-gateway: Add rest-gateway-ro domain matchers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183086 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [10:30:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet [10:31:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3005.esams.wmnet [10:31:48] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [10:34:26] FIRING: [5x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:34:34] (03CR) 10Fabfur: [C:03+1] hiera: Enable JA3N fingerprinting CDN wide [puppet] - 10https://gerrit.wikimedia.org/r/1184046 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [10:34:35] (03PS2) 10Gkyziridis: ml-services: Fix KServe batcher setup for edit-check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) [10:34:43] RECOVERY - MariaDB Replica IO: analytics_meta on db1208 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:35:26] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [10:35:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [10:35:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:35:32] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors [10:35:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors [10:35:51] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:35:52] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1083.eqiad.wmnet with OS bullseye [10:36:06] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:36:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11138352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1083.eqiad.wmnet wit... [10:36:13] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:36:34] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:36:47] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:37:06] (03CR) 10FNegri: openstack: add wmcs-server-id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) (owner: 10Filippo Giunchedi) [10:37:22] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:37:30] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:38:29] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [10:38:48] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [10:39:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P82370 and previous config saved to /var/cache/conftool/dbconfig/20250902-103901-fceratto.json [10:40:09] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [10:40:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [10:40:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:40:15] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors [10:40:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors [10:40:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir3005.esams.wmnet [10:40:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3005.esams.wmnet [10:40:49] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:43:33] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable JA3N fingerprinting CDN wide [puppet] - 10https://gerrit.wikimedia.org/r/1184046 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [10:43:42] (03CR) 10Ilias Sarantopoulos: "One comment on the maxreplicas for experimental ns, other than that looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) (owner: 10Gkyziridis) [10:44:26] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! I think the dict is right based on what's on the boxes." [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi) [10:44:26] FIRING: [5x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:44:56] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [10:45:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002" [10:45:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:45:03] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors [10:45:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors [10:45:11] (03CR) 10Cathal Mooney: "Actually I notice one nit in-line" [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi) [10:45:31] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3005.esams.wmnet - jmm@cumin2002" [10:45:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3005.esams.wmnet - jmm@cumin2002" [10:46:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3005.esams.wmnet with OS bookworm [10:46:39] (03PS3) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) [10:46:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11138377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ncredir3005.esams.wmnet with OS bookworm [10:47:40] 06SRE, 06Traffic, 10Wikidata, 10Wikidata-Query-Service: Find a solution for SPARQL federation that is blocked by stricter user agent policy enforcement - https://phabricator.wikimedia.org/T402959#11138378 (10gmodena) >>! In T402959#11132802, @CDanis wrote: > Hi @Lydia_Pintscher , SRE can make some exceptio... [10:47:48] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6828/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [10:49:18] (03CR) 10Cathal Mooney: Nokia: /routing-policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi) [10:49:36] (03CR) 10Vgutierrez: P:cache::haproxy disallow Wikidata Query Service as UA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [10:49:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:51:03] (03PS4) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) [10:51:47] (03CR) 10MVernon: [C:03+2] swift: re-add 3 nodes, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1183628 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [10:51:52] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6829/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [10:52:02] (03PS3) 10MVernon: swift: re-add 3 nodes, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1183628 (https://phabricator.wikimedia.org/T400877) [10:53:26] (03PS3) 10Gkyziridis: ml-services: Fix KServe batcher setup for edit-check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) [10:53:28] (03CR) 10MVernon: [C:03+2] swift: re-add 3 nodes, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1183628 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [10:54:01] (03PS5) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) [10:54:03] (03CR) 10Cathal Mooney: Nokia: /routing-policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi) [10:54:03] (03CR) 10Gkyziridis: ml-services: Fix KServe batcher setup for edit-check. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) (owner: 10Gkyziridis) [10:54:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P82371 and previous config saved to /var/cache/conftool/dbconfig/20250902-105411-fceratto.json [10:54:50] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6830/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [10:58:50] (03PS6) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) [10:58:53] (03CR) 10Ayounsi: Nokia: /routing-policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi) [10:59:22] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11138396 (10MatthewVernon) [10:59:47] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6831/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [11:00:57] (03PS7) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) [11:02:19] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6832/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [11:02:25] (03CR) 10Cathal Mooney: [C:03+1] "Yep no problem with this +1." [homer/public] - 10https://gerrit.wikimedia.org/r/1183099 (owner: 10Ayounsi) [11:04:22] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:08:34] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3005.esams.wmnet with reason: host reimage [11:09:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T401906)', diff saved to https://phabricator.wikimedia.org/P82372 and previous config saved to /var/cache/conftool/dbconfig/20250902-110919-fceratto.json [11:09:23] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:09:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2159.codfw.wmnet with reason: Maintenance [11:09:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T401906)', diff saved to https://phabricator.wikimedia.org/P82373 and previous config saved to /var/cache/conftool/dbconfig/20250902-110942-fceratto.json [11:11:24] (03CR) 10Muehlenhoff: [C:03+1] "Looking at the current stats of maps-test2002 we're having 50ish connections. But raising this surely can't hurt either, the new maps node" [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [11:12:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T401906)', diff saved to https://phabricator.wikimedia.org/P82374 and previous config saved to /var/cache/conftool/dbconfig/20250902-111203-fceratto.json [11:12:11] (03PS2) 10Cathal Mooney: JunOS IBGP: adjust template to work with updated data from plugin [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) [11:13:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3005.esams.wmnet with reason: host reimage [11:15:33] (03CR) 10Cathal Mooney: JunOS IBGP: adjust template to work with updated data from plugin (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [11:23:40] (03CR) 10Slyngshede: [V:03+1] P:cache::haproxy disallow Wikidata Query Service as UA (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [11:24:56] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Great, let's go!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) (owner: 10Gkyziridis) [11:25:25] (03CR) 10Ayounsi: JunOS IBGP: adjust template to work with updated data from plugin (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [11:26:15] (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [11:27:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P82375 and previous config saved to /var/cache/conftool/dbconfig/20250902-112711-fceratto.json [11:28:07] (03CR) 10Cathal Mooney: JunOS IBGP: adjust template to work with updated data from plugin (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [11:29:07] (03CR) 10Gkyziridis: [C:03+2] ml-services: Fix KServe batcher setup for edit-check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) (owner: 10Gkyziridis) [11:29:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3005.esams.wmnet with OS bookworm [11:29:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir3005.esams.wmnet [11:30:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11138474 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ncredir3005.esams.wmnet with OS bookworm completed: - ncredi... [11:31:20] (03Merged) 10jenkins-bot: ml-services: Fix KServe batcher setup for edit-check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) (owner: 10Gkyziridis) [11:32:46] FIRING: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:34:01] (03CR) 10Ayounsi: [C:03+2] "We could still make it configurable when needed. I just worry that too many nested dicts makes it too complex in the longer run." [homer/public] - 10https://gerrit.wikimedia.org/r/1183099 (owner: 10Ayounsi) [11:35:22] (03Merged) 10jenkins-bot: Nokia OSPF: different proposal [homer/public] - 10https://gerrit.wikimedia.org/r/1183099 (owner: 10Ayounsi) [11:37:19] jouncebot: nowandnext [11:37:19] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [11:37:19] In 0 hour(s) and 22 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1200) [11:37:40] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [11:38:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host durum3005.esams.wmnet [11:38:27] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:39:56] Deploying cxserver. Staging only change. [11:40:20] Emperor: I'm about to do the switchover of s1 setting all of English Wikipedia to read only and the best part, the script broke last time I did it so it might take longer as I'll might have to do a lot of stuff manually while the whole site is RO. [11:40:27] (03CR) 10KartikMistry: [C:03+2] cxserver: staging: Update to 2025-09-02-045916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183761 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry) [11:42:04] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3005.esams.wmnet - jmm@cumin2002" [11:42:05] (03Merged) 10jenkins-bot: cxserver: staging: Update to 2025-09-02-045916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183761 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry) [11:42:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P82376 and previous config saved to /var/cache/conftool/dbconfig/20250902-114219-fceratto.json [11:43:01] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [11:43:24] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:43:40] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T402870 [11:43:43] T402870: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T402870 [11:44:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set db1184 with weight 0 T402870', diff saved to https://phabricator.wikimedia.org/P82377 and previous config saved to /var/cache/conftool/dbconfig/20250902-114408-ladsgroup.json [11:44:43] (03PS1) 10Muehlenhoff: Add ncredir3005 as ncredir node [puppet] - 10https://gerrit.wikimedia.org/r/1184053 (https://phabricator.wikimedia.org/T402259) [11:44:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3005.esams.wmnet - jmm@cumin2002" [11:44:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:44:54] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache durum3005.esams.wmnet on all recursors [11:44:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum3005.esams.wmnet on all recursors [11:45:20] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3005.esams.wmnet - jmm@cumin2002" [11:45:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3005.esams.wmnet - jmm@cumin2002" [11:48:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host durum3005.esams.wmnet with OS bookworm [11:49:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11138559 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host durum3005.esams.wmnet with OS bookworm [11:52:46] RESOLVED: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:54:15] (03CR) 10Vgutierrez: [C:03+1] Add ncredir3005 as ncredir node [puppet] - 10https://gerrit.wikimedia.org/r/1184053 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [11:57:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T401906)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250902-115727-fceratto.json [11:57:47] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:57:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2168.codfw.wmnet with reason: Maintenance [11:57:48] (03PS3) 10Anzx: idwiki: Add extended confirmed usergroup & restriction level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183662 (https://phabricator.wikimedia.org/T402755) [11:57:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T401906)', diff saved to https://phabricator.wikimedia.org/P82380 and previous config saved to /var/cache/conftool/dbconfig/20250902-115754-fceratto.json [11:58:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183662 (https://phabricator.wikimedia.org/T402755) (owner: 10Anzx) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1200) [12:00:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T401906)', diff saved to https://phabricator.wikimedia.org/P82381 and previous config saved to /var/cache/conftool/dbconfig/20250902-120020-fceratto.json [12:00:32] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:00:36] (03CR) 10Muehlenhoff: [C:03+2] Add ncredir3005 as ncredir node [puppet] - 10https://gerrit.wikimedia.org/r/1184053 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [12:01:35] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:01:48] (03CR) 10Elukey: [V:03+1] "At the moment codfw takes a lot less traffic than eqiad, they are very imbalanced. My main concern is for when a single cluster needs to a" [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [12:01:54] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:02:34] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:03:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11138603 (10elukey) 05Resolved→03Open @Jclark-ctr Hi! I noticed that console redir seems not working for ml-serve1013 (but it works for 1012), and the bios settings... [12:04:03] Amir1: ack, good luck... [12:06:54] 10SRE-SLO, 10Citoid, 10VisualEditor, 10Editing-team (Kanban Board): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11138642 (10elukey) @Mvolz Hi! Sorry for the delay! Prometheus metrics cannot be filtered... [12:08:30] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [12:08:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3005.esams.wmnet with reason: host reimage [12:10:57] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:11:41] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:13:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3005.esams.wmnet with reason: host reimage [12:13:55] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1181799 (https://phabricator.wikimedia.org/T402870) [12:13:59] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1181799 (https://phabricator.wikimedia.org/T402870) (owner: 10Gerrit maintenance bot) [12:15:29] !log Starting s1 eqiad failover from db1163 to db1184 - T402870 [12:15:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P82382 and previous config saved to /var/cache/conftool/dbconfig/20250902-121531-fceratto.json [12:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:32] T402870: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T402870 [12:15:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T402870', diff saved to https://phabricator.wikimedia.org/P82383 and previous config saved to /var/cache/conftool/dbconfig/20250902-121548-ladsgroup.json [12:15:52] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:16:39] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:18:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote db1184 to s1 primary and set section read-write T402870', diff saved to https://phabricator.wikimedia.org/P82384 and previous config saved to /var/cache/conftool/dbconfig/20250902-121814-ladsgroup.json [12:20:15] (03PS2) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181800 (https://phabricator.wikimedia.org/T402870) [12:20:16] (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181800 (https://phabricator.wikimedia.org/T402870) (owner: 10Gerrit maintenance bot) [12:20:18] (03CR) 10Ladsgroup: [V:03+2 C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181800 (https://phabricator.wikimedia.org/T402870) (owner: 10Gerrit maintenance bot) [12:20:32] !log ladsgroup@dns1004 START - running authdns-update [12:21:33] !log ladsgroup@dns1004 END - running authdns-update [12:21:49] (03PS1) 10Dreamy Jazz: tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) [12:22:29] (03CR) 10Dreamy Jazz: [C:04-1] "We need to decide on which wikis we will create these tables on and then create them on production before we merge this." [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz) [12:23:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1163 T402870', diff saved to https://phabricator.wikimedia.org/P82385 and previous config saved to /var/cache/conftool/dbconfig/20250902-122310-ladsgroup.json [12:23:18] T402870: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T402870 [12:24:21] (03CR) 10CI reject: [V:04-1] tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz) [12:24:52] !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=ncredir3005.esams.wmnet [12:25:10] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir3005.esams.wmnet [12:25:16] (03PS1) 10Stevemunene: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) [12:25:17] (03PS1) 10Stevemunene: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) [12:25:20] (03PS1) 10Stevemunene: dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301) [12:25:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:25:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:26:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:27:05] (03PS1) 10Muehlenhoff: Remove ncredir3003 [puppet] - 10https://gerrit.wikimedia.org/r/1184062 (https://phabricator.wikimedia.org/T402259) [12:28:08] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1163.eqiad.wmnet with reason: Old primary of s1 [12:28:22] (03PS2) 10Dreamy Jazz: tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) [12:28:34] (03CR) 10Dreamy Jazz: [C:04-1] tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz) [12:29:10] Emperor: I'll be running an upgrade cookbook that sometimes triggers a page on db1163. I'm so sorry for this mess but if you get a page for db1163, please ignore [12:29:19] (it removes the downtime when it shouldn't) [12:29:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum3005.esams.wmnet with OS bookworm [12:29:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum3005.esams.wmnet [12:30:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11138768 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host durum3005.esams.wmnet with OS bookworm completed: - durum300... [12:30:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P82387 and previous config saved to /var/cache/conftool/dbconfig/20250902-123038-fceratto.json [12:31:15] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-mariadb1001.eqiad.wmnet [12:31:36] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for db1163.eqiad.wmnet [12:31:44] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.depool db1163 - Upgrading db1163.eqiad.wmnet [12:31:51] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1163 - Upgrading db1163.eqiad.wmnet [12:32:35] (03CR) 10CI reject: [V:04-1] dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [12:32:58] (03CR) 10CI reject: [V:04-1] dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [12:35:38] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-mariadb1001.eqiad.wmnet [12:38:03] (03PS1) 10Elukey: admin_ng: bump max pod's memory usage for edit check on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184063 (https://phabricator.wikimedia.org/T403423) [12:38:50] (03PS1) 10Brouberol: airflow-test-k8s: increase DAG file parsing interval [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184064 (https://phabricator.wikimedia.org/T402529) [12:39:58] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: increase DAG file parsing interval [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184064 (https://phabricator.wikimedia.org/T402529) (owner: 10Brouberol) [12:40:25] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-restart-ats rolling restart_daemons on A:cp [12:40:39] (03PS2) 10Brouberol: airflow-test-k8s: increase DAG file parsing interval [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184064 (https://phabricator.wikimedia.org/T402529) [12:42:47] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: increase DAG file parsing interval [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184064 (https://phabricator.wikimedia.org/T402529) (owner: 10Brouberol) [12:43:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1163.eqiad.wmnet [12:44:38] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1001.eqiad.wmnet [12:44:39] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts an-mariadb1001.eqiad.wmnet [12:44:41] (03CR) 10CI reject: [V:04-1] admin_ng: bump max pod's memory usage for edit check on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184063 (https://phabricator.wikimedia.org/T403423) (owner: 10Elukey) [12:44:45] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403431#11138809 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:45:31] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:45:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T401906)', diff saved to https://phabricator.wikimedia.org/P82388 and previous config saved to /var/cache/conftool/dbconfig/20250902-124545-fceratto.json [12:45:49] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:46:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:46:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2182.codfw.wmnet with reason: Maintenance [12:46:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T401906)', diff saved to https://phabricator.wikimedia.org/P82389 and previous config saved to /var/cache/conftool/dbconfig/20250902-124608-fceratto.json [12:47:16] (03PS2) 10Elukey: admin_ng: bump max pod's memory usage for edit check on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184063 (https://phabricator.wikimedia.org/T403423) [12:48:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T401906)', diff saved to https://phabricator.wikimedia.org/P82390 and previous config saved to /var/cache/conftool/dbconfig/20250902-124830-fceratto.json [12:51:05] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1163.eqiad.wmnet with reason: Old primary of s1 [12:53:55] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11138841 (10phaultfinder) [12:56:38] (03PS2) 10Stevemunene: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) [12:56:38] (03PS2) 10Stevemunene: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) [12:56:38] (03PS2) 10Stevemunene: dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301) [12:57:41] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184063 (https://phabricator.wikimedia.org/T403423) (owner: 10Elukey) [12:58:36] !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir3003.esams.wmnet [12:58:55] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11138862 (10phaultfinder) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1300). [13:00:05] Tran, JustHannah, kart_, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh3005.wikimedia.org [13:00:12] o/ [13:00:14] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:00:20] here [13:00:44] I’m here but don’t really have time for deploying right now :/ [13:00:51] I can self deploy Nik's patch. [13:00:54] I can deploy my own [13:01:02] Go ahead Tran [13:01:13] Thanks. Going to start - two going in at the same time as one is a no-op comment update. [13:01:37] Lucas_WMDE: I can deploy if needed. [13:01:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:01:59] (03PS1) 10Btullis: Add four hadoop workers from repurposed dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/1184068 (https://phabricator.wikimedia.org/T398438) [13:02:00] JustHannah and anzx let me know if you need help in deployment [13:02:13] I need help, please! [13:02:45] (03PS1) 10Muehlenhoff: Apply the durum role on durum3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184069 [13:02:51] (03CR) 10Elukey: [C:03+2] admin_ng: bump max pod's memory usage for edit check on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184063 (https://phabricator.wikimedia.org/T403423) (owner: 10Elukey) [13:03:12] (03CR) 10CI reject: [V:04-1] Apply the durum role on durum3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (owner: 10Muehlenhoff) [13:03:13] (03CR) 10Ssingh: [C:03+1] Apply the durum role on durum3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (owner: 10Muehlenhoff) [13:03:15] kart_: i also need someone to deploy [13:03:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154308 (https://phabricator.wikimedia.org/T396217) (owner: 10Tchanders) [13:03:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [13:03:30] (03CR) 10Ssingh: [C:03+1] Apply the durum role on durum3005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (owner: 10Muehlenhoff) [13:03:37] kart_: +1 [13:03:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P82391 and previous config saved to /var/cache/conftool/dbconfig/20250902-130338-fceratto.json [13:03:45] (03CR) 10CI reject: [V:04-1] dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [13:03:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3005.wikimedia.org - jmm@cumin2002" [13:04:18] (03Merged) 10jenkins-bot: Document that IP reveal permissions can't just be reassigned [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154308 (https://phabricator.wikimedia.org/T396217) (owner: 10Tchanders) [13:04:27] (03CR) 10CI reject: [V:04-1] dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [13:04:29] (03Merged) 10jenkins-bot: Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [13:04:30] (03CR) 10CI reject: [V:04-1] dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [13:04:38] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877#11138880 (10ssingh) >>! In T300877#11130890, @ayounsi wrote: >> the idea is that static routes should help save us in that situation > > That would only... [13:04:59] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1154308|Document that IP reveal permissions can't just be reassigned (T396217)]], [[gerrit:1180532|Enable temporary accounts on remaining small-sized projects (T402181)]] [13:05:05] T396217: Document that groups with IP reveal rights must not be changed without making changes to the cache for Special:GlobalContributions - https://phabricator.wikimedia.org/T396217 [13:05:05] T402181: Deploy Temporary accounts to all remaining small-sized projects - https://phabricator.wikimedia.org/T402181 [13:05:20] (03CR) 10Muehlenhoff: [C:03+1] "Right now with maps-test serving all traffic, we have 87 connections, but I fully agree, we have the capacity and let's use it." [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [13:05:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3005.wikimedia.org - jmm@cumin2002" [13:05:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:05:33] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh3005.wikimedia.org on all recursors [13:05:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh3005.wikimedia.org on all recursors [13:05:38] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host deploy2003.codfw.wmnet with OS bookworm [13:05:46] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11138894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host deploy2003.codfw.wmnet with OS bookworm [13:05:53] (03PS2) 10Muehlenhoff: Apply the durum role on durum3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (https://phabricator.wikimedia.org/T402259) [13:06:00] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3005.wikimedia.org - jmm@cumin2002" [13:06:00] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum2001.codfw.wmnet with OS trixie [13:06:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3005.wikimedia.org - jmm@cumin2002" [13:06:08] (03CR) 10Ssingh: Apply the durum role on durum3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [13:07:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11138908 (10Jclark-ctr) @elukey Confirmed same issue; connected to iDRAC via SSH tunnel, logged in, and reset BMC under Maintenance → BMC Reset → Selected Unit Reset. i... [13:07:39] Sure. I can deploy JustHannah anzx [13:07:45] !log install libpython3.9-dbg python3.9-dbg on ms-fe2016 for debugging [13:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh3005.wikimedia.org with OS bookworm [13:08:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11138912 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host doh3005.wikimedia.org with OS bookworm [13:09:13] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum4001.ulsfo.wmnet with OS trixie [13:10:08] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11138918 (10MoritzMuehlenhoff) [13:10:10] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:10:12] (03PS1) 10Btullis: Remove references to dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/1184070 (https://phabricator.wikimedia.org/T398438) [13:10:12] (03CR) 10Muehlenhoff: Apply the durum role on durum3005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [13:10:26] (03CR) 10Muehlenhoff: [C:03+2] Apply the durum role on durum3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [13:11:05] (03CR) 10Brouberol: [C:03+1] Remove references to dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/1184070 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [13:11:47] !log stran@deploy1003 tchanders, stran: Backport for [[gerrit:1154308|Document that IP reveal permissions can't just be reassigned (T396217)]], [[gerrit:1180532|Enable temporary accounts on remaining small-sized projects (T402181)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:11:49] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T228380 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1184071 (https://phabricator.wikimedia.org/T395443) [13:11:51] (03CR) 10Brouberol: [C:03+1] Add four hadoop workers from repurposed dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/1184068 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [13:11:57] T396217: Document that groups with IP reveal rights must not be changed without making changes to the cache for Special:GlobalContributions - https://phabricator.wikimedia.org/T396217 [13:11:57] T402181: Deploy Temporary accounts to all remaining small-sized projects - https://phabricator.wikimedia.org/T402181 [13:12:22] Testing my patches now [13:12:52] (03CR) 10Btullis: [C:03+2] Add four hadoop workers from repurposed dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/1184068 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [13:13:27] (03CR) 10Filippo Giunchedi: openstack: add wmcs-server-id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) (owner: 10Filippo Giunchedi) [13:15:10] FIRING: [10x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:15:17] ^ expectedv [13:16:53] !log stran@deploy1003 tchanders, stran: Continuing with sync [13:17:09] Done testing, finishing sync [13:18:46] (03PS9) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [13:18:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P82392 and previous config saved to /var/cache/conftool/dbconfig/20250902-131845-fceratto.json [13:19:54] (03CR) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [13:22:12] !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154308|Document that IP reveal permissions can't just be reassigned (T396217)]], [[gerrit:1180532|Enable temporary accounts on remaining small-sized projects (T402181)]] (duration: 17m 13s) [13:23:10] My deploy is done, thanks for your patience! [13:23:14] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage [13:23:16] JustHannah: will start with your patch. [13:23:19] Tran: Thanks [13:23:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183741 (https://phabricator.wikimedia.org/T362324) (owner: 10Hokwelum) [13:24:09] !log jhancock@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on deploy2003.codfw.wmnet with reason: host reimage [13:24:12] kart_:okay! [13:24:48] (03Merged) 10jenkins-bot: Set $wgPHPSessionHandling to 'disable' on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183741 (https://phabricator.wikimedia.org/T362324) (owner: 10Hokwelum) [13:25:12] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1183741|Set $wgPHPSessionHandling to 'disable' on group1 wikis (T362324)]] [13:28:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh3005.wikimedia.org with reason: host reimage [13:29:16] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T228380 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1184071 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [13:29:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:29:38] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage [13:31:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.23% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:31:59] !log kartik@deploy1003 hokwelum, kartik: Backport for [[gerrit:1183741|Set $wgPHPSessionHandling to 'disable' on group1 wikis (T362324)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:32:41] JustHannah: you can test patch now [13:33:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh3005.wikimedia.org with reason: host reimage [13:33:12] okay! Thank you! [13:33:53] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage [13:33:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T401906)', diff saved to https://phabricator.wikimedia.org/P82393 and previous config saved to /var/cache/conftool/dbconfig/20250902-133352-fceratto.json [13:34:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [13:34:21] (03PS1) 10Dreamy Jazz: Add the CheckUserMatchSuggestedInvestigationsSignalAgainstUser hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184078 (https://phabricator.wikimedia.org/T403111) [13:34:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance [13:35:06] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2208.codfw.wmnet with reason: Maintenance [13:35:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T401906)', diff saved to https://phabricator.wikimedia.org/P82394 and previous config saved to /var/cache/conftool/dbconfig/20250902-133513-fceratto.json [13:35:17] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:36:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.46% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:36:24] kart_: looks good! [13:36:31] cool. deploying.. [13:36:35] !log kartik@deploy1003 hokwelum, kartik: Continuing with sync [13:36:49] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on deploy2003.codfw.wmnet with reason: host reimage [13:37:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T401906)', diff saved to https://phabricator.wikimedia.org/P82395 and previous config saved to /var/cache/conftool/dbconfig/20250902-133736-fceratto.json [13:38:05] o/ [13:38:12] I’d be available now if needed :) [13:39:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184078 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [13:40:37] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage [13:42:15] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11139117 (10Gehel) p:05Triage→03High [13:42:16] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T400968#11139118 (10Gehel) p:05Triage→03High [13:42:26] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11139119 (10Gehel) p:05Triage→03High [13:42:36] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PybalBackendDown (instance cirrussearch2091:0) - https://phabricator.wikimedia.org/T399161#11139124 (10Gehel) p:05Triage→03High [13:42:41] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183741|Set $wgPHPSessionHandling to 'disable' on group1 wikis (T362324)]] (duration: 17m 28s) [13:42:44] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [13:42:55] JustHannah: done. [13:43:01] anzx: your patch is next. [13:43:01] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11139132 (10Gehel) [13:43:07] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T400968#11139136 (10Gehel) [13:43:12] kart_: ok [13:43:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11139134 (10Gehel) [13:43:29] 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: PybalBackendDown (instance cirrussearch2091:0) - https://phabricator.wikimedia.org/T399161#11139142 (10Gehel) [13:43:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183662 (https://phabricator.wikimedia.org/T402755) (owner: 10Anzx) [13:44:26] kart_: Thank you so much! [13:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:44:56] (03Merged) 10jenkins-bot: idwiki: Add extended confirmed usergroup & restriction level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183662 (https://phabricator.wikimedia.org/T402755) (owner: 10Anzx) [13:45:19] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1183662|idwiki: Add extended confirmed usergroup & restriction level (T402755)]] [13:45:22] T402755: Enable extended confirmed user at Indonesian Wikipedia (id.wp) - https://phabricator.wikimedia.org/T402755 [13:45:26] (03PS10) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [13:45:56] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2001.codfw.wmnet with OS trixie [13:46:55] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [13:47:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh3005.wikimedia.org with OS bookworm [13:47:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh3005.wikimedia.org [13:48:12] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11139195 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host doh3005.wikimedia.org with OS bookworm completed: - doh3005... [13:48:47] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance [13:48:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1190 (T402925)', diff saved to https://phabricator.wikimedia.org/P82396 and previous config saved to /var/cache/conftool/dbconfig/20250902-134854-ladsgroup.json [13:48:58] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [13:49:42] (03PS1) 10Muehlenhoff: Apply the wikidough role on doh3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184080 (https://phabricator.wikimedia.org/T402259) [13:50:10] FIRING: [10x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:50:23] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from dumpsdata1004 to an-worker1233 [13:50:43] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [13:50:49] (03CR) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [13:50:55] (03CR) 10Stevemunene: [C:03+1] Remove references to dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/1184070 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [13:52:15] !log kartik@deploy1003 kartik, anzx: Backport for [[gerrit:1183662|idwiki: Add extended confirmed usergroup & restriction level (T402755)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:52:18] T402755: Enable extended confirmed user at Indonesian Wikipedia (id.wp) - https://phabricator.wikimedia.org/T402755 [13:52:19] kart_: looks good, ok to sync [13:52:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P82397 and previous config saved to /var/cache/conftool/dbconfig/20250902-135243-fceratto.json [13:52:52] anzx: cool. That's fast. [13:52:58] !log kartik@deploy1003 kartik, anzx: Continuing with sync [13:53:22] yeah change was working more than two minutes ago [13:53:32] (03PS1) 10Bking: refinery: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1184081 (https://phabricator.wikimedia.org/T401116) [13:53:34] :) [13:54:51] (03Abandoned) 10Ayounsi: esams routed ganeti: add v4 and v6 IP/range [puppet] - 10https://gerrit.wikimedia.org/r/1180130 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [13:55:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184081 (https://phabricator.wikimedia.org/T401116) (owner: 10Bking) [13:55:36] !log jhancock@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [13:55:55] !log jhancock@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [13:55:56] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host deploy2003.codfw.wmnet with OS bookworm [13:55:58] (03CR) 10Xcollazo: [C:03+1] refinery: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1184081 (https://phabricator.wikimedia.org/T401116) (owner: 10Bking) [13:56:03] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11139244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host deploy2003.codfw.wmnet with OS bookworm completed: - deploy2003 (**PASS**) -... [13:56:26] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11139247 (10Jhancock.wm) 05Open→03Resolved [13:56:30] btullis@cumin1003 rename (PID 688424) is awaiting input [13:56:46] (03PS1) 10Muehlenhoff: Add EFI variant of raid5-4dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1184082 (https://phabricator.wikimedia.org/T381565) [13:57:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11139253 (10Jhancock.wm) @Clement_Goubert @jasmine_ this is complete and ready for y'all! [13:57:36] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4001.ulsfo.wmnet with OS trixie [13:57:49] (03CR) 10Volans: [C:03+1] "Looks ok but the amount of corner cases is becoming worrisome" [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [13:58:10] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183662|idwiki: Add extended confirmed usergroup & restriction level (T402755)]] (duration: 12m 51s) [13:58:13] T402755: Enable extended confirmed user at Indonesian Wikipedia (id.wp) - https://phabricator.wikimedia.org/T402755 [13:58:37] I can self deploy my one [13:59:33] (03PS1) 10Muehlenhoff: Update partman config for the new maps nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1184084 (https://phabricator.wikimedia.org/T381565) [13:59:43] kart_: Are you going to deploy your change now? [14:00:04] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1400) [14:00:07] Dreamy_Jazz: yes [14:00:11] FIRING: [10x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:00:28] 10SRE-swift-storage, 10Observability-Alerting: Remove load_average check for ms-be/thanos-be - https://phabricator.wikimedia.org/T370526#11139266 (10tappof) [14:00:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183703 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [14:01:13] 07Puppet, 10MW-on-K8s, 10Observability-Alerting: Clean up "git repo needs merge" checks - https://phabricator.wikimedia.org/T370530#11139269 (10tappof) [14:01:45] (03CR) 10Elukey: "I was about to say that there is still a little nit to solve:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [14:02:20] (03Merged) 10jenkins-bot: ContentTranslation: Add cxserver host for server-side requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183703 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [14:02:24] kart_: thanks for deploying [14:02:44] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1183703|ContentTranslation: Add cxserver host for server-side requests (T386131)]] [14:02:47] T386131: Newly translated sections of articles always placed at the bottom - https://phabricator.wikimedia.org/T386131 [14:03:07] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:03:59] (03CR) 10Aklapper: [C:03+2] Remove fallback for Asturian language [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1183280 (https://phabricator.wikimedia.org/T292750) (owner: 10Pppery) [14:04:17] (03CR) 10Aklapper: [V:03+2 C:03+2] Remove fallback for Asturian language [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1183280 (https://phabricator.wikimedia.org/T292750) (owner: 10Pppery) [14:04:19] (03PS3) 10Btullis: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:04:19] (03PS3) 10Btullis: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:04:19] (03PS3) 10Btullis: dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:07:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P82398 and previous config saved to /var/cache/conftool/dbconfig/20250902-140751-fceratto.json [14:08:42] (03PS4) 10Btullis: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:08:42] (03PS4) 10Btullis: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:09:17] !log eqsin: remove lvs static routes - T300877 [14:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:20] T300877: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 [14:09:35] !log kartik@deploy1003 kartik, ngkountas: Backport for [[gerrit:1183703|ContentTranslation: Add cxserver host for server-side requests (T386131)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:09:38] T386131: Newly translated sections of articles always placed at the bottom - https://phabricator.wikimedia.org/T386131 [14:10:08] (03PS5) 10Btullis: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:10:08] (03PS5) 10Btullis: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:12:03] (03PS6) 10Btullis: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:12:03] (03PS6) 10Btullis: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:12:50] kart_: just one question, even if i delete saved or in progress translation , it still appears when I check it again. [14:13:56] anzx: in CX? [14:14:10] yes [14:14:53] Need to check. Can you file a task with details? [14:15:18] kart_: yes I file one tomorrow, thanks [14:15:24] !log ulsfo: remove lvs static routes - T300877 [14:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:27] T300877: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 [14:21:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11139452 (10elukey) 05Open→03Resolved @Jclark-ctr confirmed that it works, thanks a lot! [14:22:02] (03CR) 10Elukey: [C:03+1] Add EFI variant of raid5-4dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1184082 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:22:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T401906)', diff saved to https://phabricator.wikimedia.org/P82399 and previous config saved to /var/cache/conftool/dbconfig/20250902-142259-fceratto.json [14:23:03] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:23:07] (03CR) 10Elukey: [C:03+1] "Adding also Yiannis: we are going to use raid5 for the new maps nodes, it will give us more space if needed for the future. Raid 5 may be " [puppet] - 10https://gerrit.wikimedia.org/r/1184084 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:23:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2218.codfw.wmnet with reason: Maintenance [14:23:16] (03CR) 10Stevemunene: [C:03+1] "Looks good, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:23:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T401906)', diff saved to https://phabricator.wikimedia.org/P82400 and previous config saved to /var/cache/conftool/dbconfig/20250902-142322-fceratto.json [14:23:31] (03CR) 10Stevemunene: [C:03+1] dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:24:43] (03PS1) 10Mszwarc: Revert "UIC: Avoid fetching revisions from wikis to make list of active wikis" [extensions/CheckUser] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184087 [14:25:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T401906)', diff saved to https://phabricator.wikimedia.org/P82401 and previous config saved to /var/cache/conftool/dbconfig/20250902-142545-fceratto.json [14:26:00] (03Abandoned) 10Mszwarc: Revert "UIC: Avoid fetching revisions from wikis to make list of active wikis" [extensions/CheckUser] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184087 (owner: 10Mszwarc) [14:26:26] !log codfw: remove lvs static routes - T300877 [14:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:29] T300877: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 [14:26:39] (03PS5) 10Btullis: dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:27:46] (03PS1) 10Mszwarc: Revert "UIC: Avoid fetching revisions from wikis to make list of active wikis" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184089 [14:28:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877#11139484 (10ayounsi) [14:28:27] (03CR) 10Bking: [C:03+2] refinery: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1184081 (https://phabricator.wikimedia.org/T401116) (owner: 10Bking) [14:29:22] We're still testing the config patch.. [14:30:07] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1430) [14:31:05] (03CR) 10Btullis: [V:03+1 C:03+2] Upgrade the dse-k8s-codfw cluster to version 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1184043 (https://phabricator.wikimedia.org/T396478) (owner: 10Btullis) [14:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [14:33:00] (03CR) 10Ayounsi: [C:03+1] JunOS IBGP: adjust template to work with updated data from plugin [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [14:34:31] (03PS1) 10Federico Ceratto: es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1184092 (https://phabricator.wikimedia.org/T402859) [14:34:37] (03PS1) 10Federico Ceratto: instances.yaml: Add es2049 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1184091 (https://phabricator.wikimedia.org/T402859) [14:40:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P82402 and previous config saved to /var/cache/conftool/dbconfig/20250902-144053-fceratto.json [14:44:47] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:04] (03CR) 10STran: [C:03+1] Revert "UIC: Avoid fetching revisions from wikis to make list of active wikis" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184089 (owner: 10Mszwarc) [14:47:22] kart_: Still testing? [14:47:45] Dreamy_Jazz: sadly, yes. [14:49:36] (03CR) 10Muehlenhoff: [C:03+2] Add EFI variant of raid5-4dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1184082 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:49:55] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:53:16] (03CR) 10Muehlenhoff: [C:03+2] Update partman config for the new maps nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1184084 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:53:20] !log kartik@deploy1003 Sync cancelled. [14:53:51] (03PS3) 10Cathal Mooney: JunOS IBGP: adjust template to work with updated data from plugin [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) [14:53:51] (03PS1) 10KartikMistry: Revert "ContentTranslation: Add cxserver host for server-side requests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184096 [14:54:12] (03PS2) 10Cathal Mooney: WMF-Plugin: Include the BGP role when exposing the IGBP data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1182796 (https://phabricator.wikimedia.org/T402577) [14:54:20] Dreamy_Jazz: I've to revert as well. [14:54:25] Okay [14:54:44] Though that shouldn't need a full scap backport because the sync never went beyond the test servers [14:55:02] And I'll overwrite what is on those when I deploy my change [14:55:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184096 (owner: 10KartikMistry) [14:55:29] (03CR) 10CI reject: [V:04-1] WMF-Plugin: Include the BGP role when exposing the IGBP data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1182796 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [14:56:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P82403 and previous config saved to /var/cache/conftool/dbconfig/20250902-145601-fceratto.json [14:56:11] (03CR) 10Stevemunene: [C:03+1] dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [14:56:24] (03Merged) 10jenkins-bot: Revert "ContentTranslation: Add cxserver host for server-side requests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184096 (owner: 10KartikMistry) [14:56:48] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1184096|Revert "ContentTranslation: Add cxserver host for server-side requests"]] [15:00:05] jelto, arnoldokoth, and mutante: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1500). [15:03:06] (03CR) 10Dreamy Jazz: [C:03+2] Add the CheckUserMatchSuggestedInvestigationsSignalAgainstUser hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184078 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [15:03:17] (03CR) 10JHathaway: [C:03+2] provision: poll for reboot via Redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 (owner: 10JHathaway) [15:03:38] !log kartik@deploy1003 kartik: Backport for [[gerrit:1184096|Revert "ContentTranslation: Add cxserver host for server-side requests"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:04:01] !log kartik@deploy1003 kartik: Continuing with sync [15:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:05:14] (03Merged) 10jenkins-bot: Add the CheckUserMatchSuggestedInvestigationsSignalAgainstUser hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184078 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [15:06:39] !log brennen@deploy1003 Started deploy [phabricator/deployment@6e0b4b1]: deploy phab2002 for T403494 [15:06:42] T403494: Deploy Phabricator/Phorge 2025-09-02 - https://phabricator.wikimedia.org/T403494 [15:07:22] !log brennen@deploy1003 Finished deploy [phabricator/deployment@6e0b4b1]: deploy phab2002 for T403494 (duration: 00m 43s) [15:07:41] !log brennen@deploy1003 Started deploy [phabricator/deployment@6e0b4b1]: deploy phab1004 for T403494 [15:08:24] !log brennen@deploy1003 Finished deploy [phabricator/deployment@6e0b4b1]: deploy phab1004 for T403494 (duration: 00m 43s) [15:08:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:45] (03PS1) 10Ladsgroup: Stop writing to categorylinks old in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184097 (https://phabricator.wikimedia.org/T399579) [15:09:31] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184096|Revert "ContentTranslation: Add cxserver host for server-side requests"]] (duration: 12m 42s) [15:10:15] (03CR) 10JHathaway: sre.hosts.provision: update cookbook for Dell iDRAC 10 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [15:10:20] finally. [15:11:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T401906)', diff saved to https://phabricator.wikimedia.org/P82404 and previous config saved to /var/cache/conftool/dbconfig/20250902-151108-fceratto.json [15:11:12] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [15:11:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2221.codfw.wmnet with reason: Maintenance [15:11:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T401906)', diff saved to https://phabricator.wikimedia.org/P82405 and previous config saved to /var/cache/conftool/dbconfig/20250902-151131-fceratto.json [15:11:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184078 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz) [15:11:55] jmm@cumin2002 reimage (PID 732606) is awaiting input [15:11:58] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1184078|Add the CheckUserMatchSuggestedInvestigationsSignalAgainstUser hook (T403111)]] [15:12:01] T403111: Suggested investigations: Define hooks to be used by private signal logic to define and implement a signal - https://phabricator.wikimedia.org/T403111 [15:13:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2011.codfw.wmnet with OS bookworm [15:13:22] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11139778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps2011.codfw.wmnet with OS bookworm [15:13:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T401906)', diff saved to https://phabricator.wikimedia.org/P82406 and previous config saved to /var/cache/conftool/dbconfig/20250902-151354-fceratto.json [15:16:12] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1184078|Add the CheckUserMatchSuggestedInvestigationsSignalAgainstUser hook (T403111)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:16:55] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [15:18:34] (03CR) 10FNegri: openstack: add wmcs-server-id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) (owner: 10Filippo Giunchedi) [15:19:18] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1163 gradually with 4 steps - Maint over [15:20:22] (03CR) 10Jforrester: "🎉" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184097 (https://phabricator.wikimedia.org/T399579) (owner: 10Ladsgroup) [15:21:02] 10ops-codfw, 06SRE, 06DC-Ops: PSU issue on es2055 - https://phabricator.wikimedia.org/T403243#11139863 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated the cable and it's normalized. should be fine and not require any other hands on it. [15:22:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403356#11139866 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:22:02] (03CR) 10FNegri: openstack: add wmcs-server-id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) (owner: 10Filippo Giunchedi) [15:22:24] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184078|Add the CheckUserMatchSuggestedInvestigationsSignalAgainstUser hook (T403111)]] (duration: 10m 25s) [15:22:27] T403111: Suggested investigations: Define hooks to be used by private signal logic to define and implement a signal - https://phabricator.wikimedia.org/T403111 [15:26:39] 06SRE, 10SRE-Access-Requests: Update SSH key for Connie Chen - https://phabricator.wikimedia.org/T403242#11139889 (10cchen) Thank you @JMeybohm! [15:29:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P82408 and previous config saved to /var/cache/conftool/dbconfig/20250902-152902-fceratto.json [15:29:36] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:55] (03CR) 10David Caro: [C:03+2] "Lets give this a try, I'll remove the other projects if they spam too much." [alerts] - 10https://gerrit.wikimedia.org/r/1182900 (https://phabricator.wikimedia.org/T402932) (owner: 10David Caro) [15:31:26] (03Merged) 10jenkins-bot: wmcs: add object storage quota alerts [alerts] - 10https://gerrit.wikimedia.org/r/1182900 (https://phabricator.wikimedia.org/T402932) (owner: 10David Caro) [15:33:07] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage [15:33:40] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:50] (03PS1) 10Muehlenhoff: Also update partman recipe for new maps/eqiad nodes [puppet] - 10https://gerrit.wikimedia.org/r/1184102 (https://phabricator.wikimedia.org/T381565) [15:38:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage [15:39:58] (03CR) 10Muehlenhoff: [C:03+2] Also update partman recipe for new maps/eqiad nodes [puppet] - 10https://gerrit.wikimedia.org/r/1184102 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:42:10] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-ats (exit_code=0) rolling restart_daemons on A:cp [15:44:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P82410 and previous config saved to /var/cache/conftool/dbconfig/20250902-154409-fceratto.json [15:45:25] (03PS3) 10Cathal Mooney: WMF-Plugin: Include the BGP role when exposing the IGBP data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1182796 (https://phabricator.wikimedia.org/T402577) [15:50:54] (03PS1) 10Nik Gkountas: ContentTranslation: Add cxserver host for server-side requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184112 (https://phabricator.wikimedia.org/T386131) [15:51:15] (03PS1) 10Jdlrobson: Send email alerts to Reading Web Slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1184113 [15:52:37] (03CR) 10Ssingh: [C:03+1] Apply the wikidough role on doh3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184080 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [15:52:41] (03PS2) 10Jdlrobson: Send email alerts to Reading Web Slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298) [15:52:44] (03CR) 10Ssingh: [C:03+1] Remove ncredir3003 [puppet] - 10https://gerrit.wikimedia.org/r/1184062 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [15:53:21] (03PS1) 10DLynch: Edit check: set up the tone check a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184115 (https://phabricator.wikimedia.org/T389231) [15:55:52] (03PS2) 10Ebernhardson: cirrus: Stop using auto_expand_replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182192 (https://phabricator.wikimedia.org/T402627) [15:55:56] (03CR) 10KartikMistry: [C:03+1] ContentTranslation: Add cxserver host for server-side requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184112 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [15:56:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2011.codfw.wmnet with OS bookworm [15:56:32] (03CR) 10Muehlenhoff: [C:03+2] Apply the wikidough role on doh3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184080 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [15:56:33] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11140029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps2011.codfw.wmnet with OS bookworm completed: - maps2011 (**PASS**) - Downt... [15:59:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T401906)', diff saved to https://phabricator.wikimedia.org/P82412 and previous config saved to /var/cache/conftool/dbconfig/20250902-155918-fceratto.json [15:59:25] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [15:59:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2222.codfw.wmnet with reason: Maintenance [15:59:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T401906)', diff saved to https://phabricator.wikimedia.org/P82413 and previous config saved to /var/cache/conftool/dbconfig/20250902-155942-fceratto.json [16:00:05] jhathaway and moritzm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184115 (https://phabricator.wikimedia.org/T389231) (owner: 10DLynch) [16:00:52] (03PS1) 10Ayounsi: Revert "Remove magru RIPE Atlas Anchor" [puppet] - 10https://gerrit.wikimedia.org/r/1184116 [16:01:49] (03PS2) 10Ayounsi: Revert "Remove magru RIPE Atlas Anchor" [puppet] - 10https://gerrit.wikimedia.org/r/1184116 [16:01:54] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595) [16:02:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T401906)', diff saved to https://phabricator.wikimedia.org/P82414 and previous config saved to /var/cache/conftool/dbconfig/20250902-160204-fceratto.json [16:02:23] (03CR) 10Sbisson: [C:03+1] ContentTranslation: Add cxserver host for server-side requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184112 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [16:03:02] (03CR) 10CI reject: [V:04-1] Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:04:45] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1163 gradually with 4 steps - Maint over [16:05:46] (03PS2) 10Krinkle: Disable wmgUseMdotRouting on testwiki in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183700 (https://phabricator.wikimedia.org/T401595) [16:05:46] (03PS2) 10Krinkle: Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595) [16:06:52] 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 doesn't come back up during reimage - https://phabricator.wikimedia.org/T403375#11140060 (10Papaul) I took a look at the node, we do have a backplane issue see error below . The server is not coming up after a reboot. ` The System Configuration Check operation result... [16:09:57] 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 doesn't come back up during reimage - https://phabricator.wikimedia.org/T403375#11140073 (10RobH) a:03RobH So that means a bad backplane or mainboard (likely backplane). I'll steal this task and open a support ticket to have a tech dispatched with a replacement part. [16:09:59] !oncall-now [16:09:59] Oncall now for team SRE, rotation business_hours: [16:09:59] m.utante, u.random [16:13:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2011.codfw.wmnet with OS bookworm [16:13:48] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11140092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps2011.codfw.wmnet with OS bookworm [16:16:17] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184116 (owner: 10Ayounsi) [16:16:25] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P82416 and previous config saved to /var/cache/conftool/dbconfig/20250902-161711-fceratto.json [16:19:02] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11140117 (10VRiley-WMF) I created an account at Juniper, tried to open a support case for it for me to get added, however I was unable to do that. Notified @RobH and he said he'd look into it. [16:22:35] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877#11140127 (10ssingh) Thanks for taking care of this @ayounsi! We will update this task when we are ready to remove the `eqiad` ones. [16:24:48] (03CR) 10Ayounsi: [C:03+2] Revert "Remove magru RIPE Atlas Anchor" [puppet] - 10https://gerrit.wikimedia.org/r/1184116 (owner: 10Ayounsi) [16:25:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:27:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:29:53] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2042.codfw.wmnet [16:29:56] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cp2042.codfw.wmnet [16:32:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P82417 and previous config saved to /var/cache/conftool/dbconfig/20250902-163219-fceratto.json [16:33:48] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage [16:34:15] (03PS1) 10DLynch: Edit check: deploy tone a/b test to frwiki, jawiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184120 (https://phabricator.wikimedia.org/T389231) [16:36:31] (03Abandoned) 10Ebernhardson: cirrus: Stop using auto_expand_replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182192 (https://phabricator.wikimedia.org/T402627) (owner: 10Ebernhardson) [16:38:35] 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11140228 (10xcollazo) CC @BTullis [16:38:36] (03PS2) 10DLynch: Edit check: log to VEFU if a tone check would have been shown if not for the a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184121 (https://phabricator.wikimedia.org/T394952) [16:38:39] (03CR) 10CI reject: [V:04-1] Edit check: log to VEFU if a tone check would have been shown if not for the a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184121 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch) [16:38:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage [16:41:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T402925)', diff saved to https://phabricator.wikimedia.org/P82418 and previous config saved to /var/cache/conftool/dbconfig/20250902-164155-ladsgroup.json [16:41:59] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [16:43:03] (03PS1) 10Sbisson: CxServerClient: Log url instead of relative path upon failure [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184122 (https://phabricator.wikimedia.org/T386131) [16:44:03] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 33.44 ms [16:44:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184122 (https://phabricator.wikimedia.org/T386131) (owner: 10Sbisson) [16:44:56] (03PS3) 10DLynch: Edit check: log to VEFU if a tone check would have been shown if not for the a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184121 (https://phabricator.wikimedia.org/T394952) [16:45:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11140268 (10Jhancock.wm) connected. using the serial connection for the ps1-b7-codfw temporarily. if we need a more permanent line, lmk and i can run it. [16:47:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T401906)', diff saved to https://phabricator.wikimedia.org/P82419 and previous config saved to /var/cache/conftool/dbconfig/20250902-164727-fceratto.json [16:47:31] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [16:49:43] FIRING: RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:51:33] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11140320 (10Jhancock.wm) hey! i got a thing mixed up but everything is good now. my bad. please let me know if you need anything else! [16:51:50] (03PS1) 10DLynch: Edit check: log to VEFU if a tone check would have been shown if not for the a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184124 (https://phabricator.wikimedia.org/T394952) [16:56:49] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11140332 (10Jgreen) >>! In T400275#11140320, @Jhancock.wm wrote: > hey! i got a thing mixed up but everything is good now. my bad. please let me know if you need anything else! Co... [16:57:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P82420 and previous config saved to /var/cache/conftool/dbconfig/20250902-165702-ladsgroup.json [16:57:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host maps2011.codfw.wmnet with OS bookworm [16:58:26] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11140333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps2011.codfw.wmnet with OS bookworm completed: - maps2011 (**PASS**) - Downt... [16:58:29] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11140334 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps2011.codfw.wmnet with OS bookworm executed with errors: - maps2011 (**FAIL**... [16:58:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184112 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [16:58:52] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11140340 (10phaultfinder) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1700) [17:01:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:03:59] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11140361 (10phaultfinder) [17:04:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:07:34] (03PS1) 10Jasmine: switchdc: remove mw-wikifunctions discovery services following move to k8s ingress [cookbooks] - 10https://gerrit.wikimedia.org/r/1184125 (https://phabricator.wikimedia.org/T397874) [17:09:33] (03CR) 10VolkerE: Update vector search config with new wgVectorTypeahead (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang) [17:10:13] (03CR) 10Herron: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [17:12:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P82421 and previous config saved to /var/cache/conftool/dbconfig/20250902-171210-ladsgroup.json [17:13:16] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3005'] [17:13:40] (03PS1) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) [17:13:50] (03CR) 10Herron: [C:03+2] profile::pyrra::filesystem::slo: add new slo define [puppet] - 10https://gerrit.wikimedia.org/r/1182886 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron) [17:14:07] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ganeti3005'] [17:14:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184121 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch) [17:14:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184124 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch) [17:14:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183154 (https://phabricator.wikimedia.org/T403127) (owner: 10Jdlrobson) [17:15:43] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3005'] [17:16:00] (03CR) 10VolkerE: Remove deprecated search config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182875 (https://phabricator.wikimedia.org/T402208) (owner: 10Bernard Wang) [17:19:56] robh@cumin2002 upgrade-firmware (PID 799202) is awaiting input [17:20:07] (03CR) 10VolkerE: [C:04-1] Send email alerts to Reading Web Slack channel (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298) (owner: 10Jdlrobson) [17:27:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T402925)', diff saved to https://phabricator.wikimedia.org/P82422 and previous config saved to /var/cache/conftool/dbconfig/20250902-172718-ladsgroup.json [17:27:22] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [17:27:34] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance [17:27:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1199 (T402925)', diff saved to https://phabricator.wikimedia.org/P82423 and previous config saved to /var/cache/conftool/dbconfig/20250902-172741-ladsgroup.json [17:28:21] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3005'] [17:28:49] (03PS3) 10Jdlrobson: Send email alerts to Reading Web "Performance Alert" Slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298) [17:29:16] (03PS4) 10Jdlrobson: Send email alerts to Reading Web "Performance Alert" Slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298) [17:29:33] (03PS5) 10Jdlrobson: Send email alerts to Reading Web "Performance Alert" Slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298) [17:29:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:29:39] (03CR) 10Jdlrobson: Send email alerts to Reading Web "Performance Alert" Slack channel (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298) (owner: 10Jdlrobson) [17:30:06] (03PS1) 10Bking: stat hosts: alert on I/O stalls [alerts] - 10https://gerrit.wikimedia.org/r/1184128 (https://phabricator.wikimedia.org/T401589) [17:36:37] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11140566 (10Krinkle) [17:38:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [17:39:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:40:36] 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 doesn't come back up during reimage - https://phabricator.wikimedia.org/T403375#11140589 (10RobH) a:05RobH→03MoritzMuehlenhoff After updating the idrac, bios, and backplane firmware and resetting & then allowing the system to post a few times, it hasn't shown the... [17:40:55] (03PS3) 10Herron: pyrra: citoid enable revision param [puppet] - 10https://gerrit.wikimedia.org/r/1182898 (https://phabricator.wikimedia.org/T400073) [17:41:07] (03PS2) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595) [17:41:07] (03PS1) 10Krinkle: varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510) [17:43:01] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184131 (https://phabricator.wikimedia.org/T403510) [17:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:46:24] (03CR) 10Herron: [C:03+2] pyrra: citoid enable revision param [puppet] - 10https://gerrit.wikimedia.org/r/1182898 (https://phabricator.wikimedia.org/T400073) (owner: 10Herron) [17:51:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:51:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [18:00:05] dancy and andre: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1800). [18:00:14] o/ [18:00:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:02:17] Pressing the button [18:02:33] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184132 (https://phabricator.wikimedia.org/T396378) [18:02:35] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184132 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot) [18:03:27] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184132 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot) [18:04:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:06:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:06:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:15:00] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.17 refs T396378 [18:15:04] T396378: 1.45.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T396378 [18:16:55] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11140708 (10Krinkle) [18:19:55] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:21:49] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:31:14] jouncebot: nowandnext [18:31:14] For the next 1 hour(s) and 28 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1800) [18:31:14] In 1 hour(s) and 28 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T2000) [18:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [18:33:09] dancy: hiiiii, sorry to bother, when you're done with the deploy and nothing is needed and all okay (no rush, totally). Would you mind giving me a heads up so I quickly deploy something? [18:33:19] Amir1: All yours! [18:33:30] oh nice [18:33:39] party time [18:33:57] (03CR) 10Ladsgroup: [C:03+2] Stop writing to categorylinks old in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184097 (https://phabricator.wikimedia.org/T399579) (owner: 10Ladsgroup) [18:34:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184097 (https://phabricator.wikimedia.org/T399579) (owner: 10Ladsgroup) [18:34:53] (03Merged) 10jenkins-bot: Stop writing to categorylinks old in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184097 (https://phabricator.wikimedia.org/T399579) (owner: 10Ladsgroup) [18:35:16] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1184097|Stop writing to categorylinks old in enwiki (T399579)]] [18:35:19] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [18:39:22] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1184097|Stop writing to categorylinks old in enwiki (T399579)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:41:51] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [18:44:41] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:47:14] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184097|Stop writing to categorylinks old in enwiki (T399579)]] (duration: 11m 57s) [18:47:17] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [18:49:55] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:54:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184089 (owner: 10Mszwarc) [18:57:40] (03CR) 10Ladsgroup: [C:04-1] "if you search for es2049 in icinga.wikimedia.org, there is a massive disk space warning: https://icinga.wikimedia.org/cgi-bin/icinga/extin" [puppet] - 10https://gerrit.wikimedia.org/r/1184092 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [18:58:06] (03CR) 10Ladsgroup: [C:04-1] "since in es2026 disk usage is 28% and in es2049 it's 95%" [puppet] - 10https://gerrit.wikimedia.org/r/1184092 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [18:58:29] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Novem Linguae - https://phabricator.wikimedia.org/T403336#11140824 (10Ottomata) Approved! [19:01:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:06] (03CR) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [19:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:06:05] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:09:13] (03CR) 10JHathaway: sre.hosts.provision: update cookbook for Dell iDRAC 10 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [19:12:55] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:13:24] (03CR) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [19:19:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:21:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:23:51] (03PS11) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [19:24:39] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:28:59] (03CR) 10Elukey: [C:04-1] sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [19:31:06] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:31:40] (03PS12) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [19:31:55] (03PS13) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [19:32:20] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:32:31] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:32:51] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:33:03] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:33:33] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:36:48] (03CR) 10Elukey: "Still see:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [19:37:47] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:38:04] (03CR) 10JHathaway: sre.hosts.provision: update cookbook for Dell iDRAC 10 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [19:39:23] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:41:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:45:12] (03PS1) 10Jforrester: [WIP] Disable ShortURL everywhere, without migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184153 [19:47:15] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11141081 (10Jhancock.wm) i set it to the one i have that starts with a T. I can set it to something else if that one doesn't work for you, or you aren't sure which i'm talking about. [19:48:00] (03PS14) 10Elukey: WIP - sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [19:48:07] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:49:00] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:52:25] elukey@cumin1003 provision (PID 726973) is awaiting input [19:56:36] (03CR) 10Ottomata: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [19:58:56] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:59:50] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T2000) [20:00:05] danisztls, kemayo, stephanebisson, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ I can self-deploy [20:00:17] o/ as can I [20:00:42] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11141178 (10ecarg) 05Open→03Resolved a:03ecarg Marking this as 'Resolved' because the inaugural board is s... [20:00:44] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11141181 (10Jgreen) >>! In T400275#11141081, @Jhancock.wm wrote: > i set it to the one i have that starts with a T. I can set it to something else if that one doesn't work for you,... [20:01:19] (03PS15) 10Elukey: WIP - sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [20:01:22] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:01:32] I will start deploying my patch to not keep everyone waiting [20:01:33] (03PS2) 10Jforrester: [WIP] Disable ShortURL everywhere, without migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184153 (https://phabricator.wikimedia.org/T107188) [20:01:34] danisztls: you've just got one, so want to go first? [20:01:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:01:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183753 (https://phabricator.wikimedia.org/T402915) (owner: 10DDesouza) [20:02:10] Kemayo: yep [20:02:43] (03Merged) 10jenkins-bot: Pre-deploy Newcomers survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183753 (https://phabricator.wikimedia.org/T402915) (owner: 10DDesouza) [20:03:08] !log dani@deploy1003 Started scap sync-world: Backport for [[gerrit:1183753|Pre-deploy Newcomers survey on enwiki (T402915)]] [20:03:11] T402915: Newcomer survey: first test, then launch a quicksurvey - https://phabricator.wikimedia.org/T402915 [20:06:40] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:07:40] (03PS16) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [20:08:31] (03CR) 10Elukey: "Worked on cp2043, I'll try on other nodes too. Let me know if the code is sound, I had to add another workaround for a weird use case in t" [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [20:08:47] jouncebot now [20:08:48] For the next 0 hour(s) and 51 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T2000) [20:09:14] !log dani@deploy1003 dani: Backport for [[gerrit:1183753|Pre-deploy Newcomers survey on enwiki (T402915)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:17] T402915: Newcomer survey: first test, then launch a quicksurvey - https://phabricator.wikimedia.org/T402915 [20:09:46] !log dani@deploy1003 dani: Continuing with sync [20:10:16] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184155 (https://phabricator.wikimedia.org/T128546) [20:10:25] (03CR) 10CI reject: [V:04-1] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184155 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:12:08] (03CR) 10Scott French: [C:03+1] switchdc: remove mw-wikifunctions discovery services following move to k8s ingress [cookbooks] - 10https://gerrit.wikimedia.org/r/1184125 (https://phabricator.wikimedia.org/T397874) (owner: 10Jasmine) [20:12:14] (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184155 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:12:40] (03PS1) 10BryanDavis: hcaptcha: Redirect / to mw.o project page [puppet] - 10https://gerrit.wikimedia.org/r/1184157 [20:12:40] (03PS1) 10BryanDavis: hcaptcha: Respond with HTTP 405 to disallowed methods [puppet] - 10https://gerrit.wikimedia.org/r/1184158 [20:12:58] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184159 (https://phabricator.wikimedia.org/T128546) [20:13:35] (03PS1) 10DDesouza: Fix typo on newcomers survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184160 (https://phabricator.wikimedia.org/T402915) [20:15:02] !log dani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183753|Pre-deploy Newcomers survey on enwiki (T402915)]] (duration: 11m 53s) [20:15:05] T402915: Newcomer survey: first test, then launch a quicksurvey - https://phabricator.wikimedia.org/T402915 [20:15:32] Kemayo: all yours [20:15:42] danisztls: thanks [20:16:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184115 (https://phabricator.wikimedia.org/T389231) (owner: 10DLynch) [20:16:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184121 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch) [20:17:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T402925)', diff saved to https://phabricator.wikimedia.org/P82426 and previous config saved to /var/cache/conftool/dbconfig/20250902-201705-ladsgroup.json [20:17:09] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [20:18:04] (03Merged) 10jenkins-bot: Edit check: set up the tone check a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184115 (https://phabricator.wikimedia.org/T389231) (owner: 10DLynch) [20:18:06] (03Merged) 10jenkins-bot: Edit check: log to VEFU if a tone check would have been shown if not for the a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184121 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch) [20:18:36] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1184115|Edit check: set up the tone check a/b test (T389231 T402195)]], [[gerrit:1184121|Edit check: log to VEFU if a tone check would have been shown if not for the a/b test (T394952)]] [20:18:42] T389231: Deploy config change to start the Tone Check A/B Test - https://phabricator.wikimedia.org/T389231 [20:18:43] T402195: Improve edit check a/b test configuration to cope with multiple tests running side by side - https://phabricator.wikimedia.org/T402195 [20:18:43] T394952: Log edits when Tone Check would've been shown had someone not been in control group - https://phabricator.wikimedia.org/T394952 [20:19:06] (03PS1) 10Lucas Werkmeister: Revert "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184161 (https://phabricator.wikimedia.org/T362324) [20:19:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184161 (https://phabricator.wikimedia.org/T362324) (owner: 10Lucas Werkmeister) [20:19:58] ^ if we have time, I’d love to get this deployed (cc MatmaRex) [20:20:14] 👍 [20:24:49] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1184115|Edit check: set up the tone check a/b test (T389231 T402195)]], [[gerrit:1184121|Edit check: log to VEFU if a tone check would have been shown if not for the a/b test (T394952)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:24:55] T389231: Deploy config change to start the Tone Check A/B Test - https://phabricator.wikimedia.org/T389231 [20:24:55] T402195: Improve edit check a/b test configuration to cope with multiple tests running side by side - https://phabricator.wikimedia.org/T402195 [20:24:56] T394952: Log edits when Tone Check would've been shown had someone not been in control group - https://phabricator.wikimedia.org/T394952 [20:25:37] !log kemayo@deploy1003 kemayo: Continuing with sync [20:25:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:27:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:27:45] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Reimage sretest2009 as a wikikube worker and assess performance - https://phabricator.wikimedia.org/T400871#11141418 (10Jhancock.wm) Hi, checking is to see if I can remove the ops-codfw tag? I'm cleaning up our board. Are you using the tag to organize in some way... [20:28:22] (03CR) 10Bartosz Dziewoński: [C:03+1] Revert "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184161 (https://phabricator.wikimedia.org/T362324) (owner: 10Lucas Werkmeister) [20:29:22] (03CR) 10Kosta Harlan: [C:03+1] hcaptcha: Respond with HTTP 405 to disallowed methods [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis) [20:30:27] (03CR) 10Kosta Harlan: [C:03+1] hcaptcha: Redirect / to mw.o project page [puppet] - 10https://gerrit.wikimedia.org/r/1184157 (owner: 10BryanDavis) [20:31:11] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184115|Edit check: set up the tone check a/b test (T389231 T402195)]], [[gerrit:1184121|Edit check: log to VEFU if a tone check would have been shown if not for the a/b test (T394952)]] (duration: 12m 34s) [20:31:17] T389231: Deploy config change to start the Tone Check A/B Test - https://phabricator.wikimedia.org/T389231 [20:31:17] T402195: Improve edit check a/b test configuration to cope with multiple tests running side by side - https://phabricator.wikimedia.org/T402195 [20:31:17] T394952: Log edits when Tone Check would've been shown had someone not been in control group - https://phabricator.wikimedia.org/T394952 [20:31:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling needed between cages to eqiad 2025/6 switch refresh - https://phabricator.wikimedia.org/T402432#11141430 (10wiki_willy) a:03Jclark-ctr [20:31:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183154 (https://phabricator.wikimedia.org/T403127) (owner: 10Jdlrobson) [20:32:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P82427 and previous config saved to /var/cache/conftool/dbconfig/20250902-203212-ladsgroup.json [20:34:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11141445 (10wiki_willy) a:03VRiley-WMF [20:35:31] (03PS2) 10JHathaway: acme_chief: purge old certs [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) [20:35:53] (03CR) 10JHathaway: "good idea, added" [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [20:36:14] (03CR) 10CI reject: [V:04-1] acme_chief: purge old certs [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [20:37:23] (03PS3) 10JHathaway: acme_chief: purge old certs [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) [20:38:25] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [20:45:15] (03Merged) 10jenkins-bot: Restore ext.visualEditor.track module [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183154 (https://phabricator.wikimedia.org/T403127) (owner: 10Jdlrobson) [20:45:39] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1183154|Restore ext.visualEditor.track module (T403127)]] [20:45:44] T403127: VisualEditor is loading oojs-ui on desktop page load - https://phabricator.wikimedia.org/T403127 [20:47:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P82428 and previous config saved to /var/cache/conftool/dbconfig/20250902-204722-ladsgroup.json [20:51:31] !log kemayo@deploy1003 jdlrobson, kemayo: Backport for [[gerrit:1183154|Restore ext.visualEditor.track module (T403127)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:51:34] T403127: VisualEditor is loading oojs-ui on desktop page load - https://phabricator.wikimedia.org/T403127 [20:52:43] !log kemayo@deploy1003 jdlrobson, kemayo: Continuing with sync [20:57:54] jouncebot next [20:57:54] In 0 hour(s) and 2 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T2100) [20:57:59] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183154|Restore ext.visualEditor.track module (T403127)]] (duration: 12m 20s) [20:58:02] T403127: VisualEditor is loading oojs-ui on desktop page load - https://phabricator.wikimedia.org/T403127 [20:58:25] Technically I have one more I could deploy, but if someone else wants to get something in then I don't mind. [20:58:36] Web doesn't exist any more, after all, so that window should be free. [20:58:46] I would love to get my config change deployed, it hopefully fixes a regression in several tools [20:58:59] Go for it. [20:59:20] Mine can be done a little later but I would love to squeeze it in the next hour [20:59:35] Kemayo: o/ I'm planning on using the web deployment window today but ya'll can finish any backports first. [20:59:51] ok I guess I’ll deploy “my” config change then [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T2100) [21:00:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184161 (https://phabricator.wikimedia.org/T362324) (owner: 10Lucas Werkmeister) [21:00:32] * lucaswerkmeister tries to put together an X-Wikimedia-Debug compatible test in the meantime [21:01:11] (03Merged) 10jenkins-bot: Revert "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184161 (https://phabricator.wikimedia.org/T362324) (owner: 10Lucas Werkmeister) [21:01:36] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1184161|Revert "Set $wgPHPSessionHandling to 'disable' on group1 wikis" (T362324 T403519)]] [21:01:42] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [21:01:42] T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519 [21:02:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T402925)', diff saved to https://phabricator.wikimedia.org/P82429 and previous config saved to /var/cache/conftool/dbconfig/20250902-210229-ladsgroup.json [21:02:33] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [21:02:35] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance [21:02:52] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [21:03:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1221 (T402925)', diff saved to https://phabricator.wikimedia.org/P82430 and previous config saved to /var/cache/conftool/dbconfig/20250902-210259-ladsgroup.json [21:03:53] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11141597 (10phaultfinder) [21:06:20] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, lucaswerkmeister: Backport for [[gerrit:1184161|Revert "Set $wgPHPSessionHandling to 'disable' on group1 wikis" (T362324 T403519)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:06:24] testing [21:06:47] hm, nothing so far… let me try logging out and back in [21:06:56] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:08:52] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11141606 (10phaultfinder) [21:08:56] still nothing [21:09:05] but I’m not confident I’m sending the XWD header correctly [21:09:25] so I think I’ll go ahead with the deployment anyway [21:09:30] ok [21:09:34] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, lucaswerkmeister: Continuing with sync [21:09:44] :D [21:10:34] (I put `self.session.headers['X-Wikimedia-Debug'] = 'backend=k8s-mwdebug'` in the Runner’s __post__init() fwiw) [21:10:58] (I don’t think I can easily get mwapi to show me the response headers that would indicate the actual server) [21:11:03] jan_drewniak: there is no deployments after ours so I guess we can go late if needed. You want to meet early? [21:13:08] lucaswerkmeister: there is also &servedby=1 which will add a field to the API response body [21:13:16] ooh [21:13:50] 'servedby': 'mw-debug.eqiad.pinkunicorn-7447bd958c-k6bw8' [21:13:51] hm [21:13:57] sounds like it doesn’t fix the issue then 😔 [21:14:10] (thanks anyway, I’ll try to remember that parameter ^^) [21:14:32] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:14:44] Jdlrobson: I have a portal banner to deploy after both Lucas_WMDE and lucaswerkmeister are done their deployment :P we can meet once I get that started. [21:14:57] should be done in a moment :P [21:15:00] no problem, we can reenable that once we figure out the current issue [21:15:03] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184161|Revert "Set $wgPHPSessionHandling to 'disable' on group1 wikis" (T362324 T403519)]] (duration: 13m 27s) [21:15:04] ^ what Lucas_WMDE said [21:15:08] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [21:15:08] T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519 [21:15:10] * Lucas_WMDE done deploying [21:15:23] (since MatmaRex doesn’t need an immediate revert) [21:15:26] jan_drewniak: over to you [21:15:35] thank you! [21:15:44] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184159 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:15:46] we've been working on session handling code recently, one of the changes must have affected this somehow [21:16:11] (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [21:16:30] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184159 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:16:39] but it’s strange that it happens on wikidatawiki (group1) and not enwiki (group2) today, since they’re on the same train version [21:17:21] MatmaRex: plot twist, Mahir256 reports something started working again [21:17:25] * lucaswerkmeister tests my tools some more [21:18:14] ok “real” QuickCategories also works again [21:18:15] if this is related to sessions, it might help to "log out" your tools and log in again [21:18:18] no idea why my localhost test still has the issue [21:18:36] I tried to completely log out during the mwdebug phase (Special:UserLogout and revoke on Special:OAuthManageMyGrants) [21:18:51] (and discard the session in the tool to repeat the OAuth authorization) [21:19:58] I mean, I’m happy my tool is alive again, I guess :D [21:20:22] guess we’re somehow still using PHP session handling after all (perhaps only in OAuth?) [21:21:48] these are all oauth 1 apps? [21:21:57] wait, it’s obvious. my localhost test is against testwiki [21:22:00] i.e. group0 [21:22:03] that revert only fixed group1 :D [21:22:29] MatmaRex: as far as I’m aware at least, though I’m not sure about all of them (e.g. https://github.com/maxlath/wikibase-cli/issues/192 doesn’t say which oauth version or consumer) [21:23:59] lucaswerkmeister: in some unit tests we've seen cases where PHPSessionHandler magically transported values between different instances of WebRequest / mocked SessionManager. i wonder if in some cases this also happens in real code [21:24:28] jouncebot: nowandnext [21:24:28] For the next 0 hour(s) and 35 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T2100) [21:24:28] In 8 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T0600) [21:24:32] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 9.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:25:15] jan_drewniak: can you ping me when you’re done? I’d like to do another revert (but it’s a bit less urgent) [21:25:32] also stephanebisson ebernhardson, did you still want to deploy? sorry for butting in before you [21:26:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:27:43] No worries, I'll reschedule. [21:28:02] lucaswerkmeister: you want to revert it on group0 as well? [21:28:11] I’d do that, yeah [21:28:15] unless you’re against it? [21:28:21] but there’s plenty of non-test wikis in that group [21:28:34] no, i think reverting is okay [21:28:39] just leave us testwiki for testing :) [21:28:41] we could leave it on on testwiki if that helps [21:28:43] jinx ^^ [21:28:52] I’ll just need to retarget my test for test2wiki then ^^ [21:29:00] ok lemme put together the config change [21:29:33] !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1184159| Bumping portals to master (T128546)]] (duration: 11m 18s) [21:29:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:29:37] ah, a plain revert already leaves it disabled on testwiki, I don’t even need to change stuff [21:29:38] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [21:29:54] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:30:24] (03PS1) 10Lucas Werkmeister: Revert "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184166 (https://phabricator.wikimedia.org/T362324) [21:31:11] dangit, test2wiki is in group1, I’ll need to test on another group0 wiki [21:31:34] !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1184159| Bumping portals to master (T128546)]] (duration: 01m 59s) [21:32:56] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:33:08] lucaswerkmeister: i'll be off for tonight, i'll try to investigate this tomorrow (or maybe someone else will, i'll write in our team channel). to reproduce this, i should be able to use QuickCategories against test.wikipedia.org? i've never seen that tool before, but i can probably figure it out [21:33:19] yes [21:33:31] input could look like: [21:33:31] User:Lucas Werkmeister/sandbox|+Category:Testing T403519 [21:33:33] T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519 [21:33:52] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:01] for testing, please don’t use “background” mode, as that would then break the background runner across the whole tool :) [21:34:02] (03PS3) 10Jdlrobson: Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 [21:34:14] so use the “Run these commands” button instead (that’s foreground mode) [21:36:29] lucaswerkmeister: mind if i copy-paste to the task? [21:36:35] sure [21:36:42] I was also thinking of leaving a comment to that effect later [21:36:45] I can also do it now while I wait ^^ [21:37:03] oh, yeah, please do. thank you [21:37:12] and sorry for breaking it D: [21:39:58] lucaswerkmeister: i just need to do one deploy in our window then can pass back to you [21:40:03] ok [21:40:33] MatmaRex: just before you leave – I’m on holiday from Thursday, so I won’t be able to run the maintenance script from T398177 for you (it wouldn’t finish in time) [21:40:34] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [21:40:43] just in case you were hoping to start running that tomorrow :) [21:40:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182875 (https://phabricator.wikimedia.org/T402208) (owner: 10Bernard Wang) [21:41:14] (you can still run it, you just need to find someone else who’ll still be around when it finishes ^^) [21:41:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:41:43] (03Merged) 10jenkins-bot: Remove deprecated search config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182875 (https://phabricator.wikimedia.org/T402208) (owner: 10Bernard Wang) [21:42:08] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1182875|Remove deprecated search config (T402208)]] [21:42:12] T402208: Remove old search config - https://phabricator.wikimedia.org/T402208 [21:43:43] Lucas_WMDE: sure, no problem [21:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:48:04] !log jdlrobson@deploy1003 jdlrobson, bwang: Backport for [[gerrit:1182875|Remove deprecated search config (T402208)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:48:08] T402208: Remove old search config - https://phabricator.wikimedia.org/T402208 [21:48:58] !log jdlrobson@deploy1003 jdlrobson, bwang: Continuing with sync [21:50:31] Lucas_WMDE: over to you [21:51:21] thanks! [21:51:35] hang on, spiderpig’s still running ^^ [21:51:40] I guess I’ll go as soon as it’s done [21:54:14] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182875|Remove deprecated search config (T402208)]] (duration: 12m 06s) [21:54:18] T402208: Remove old search config - https://phabricator.wikimedia.org/T402208 [21:54:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184166 (https://phabricator.wikimedia.org/T362324) (owner: 10Lucas Werkmeister) [21:54:39] let’s go [21:55:29] (03Merged) 10jenkins-bot: Revert "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184166 (https://phabricator.wikimedia.org/T362324) (owner: 10Lucas Werkmeister) [21:55:54] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1184166|Revert "Set $wgPHPSessionHandling to 'disable' on group0 wikis" (T362324 T403519)]] [21:55:59] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [21:55:59] T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519 [21:58:56] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:00:16] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister, lucaswerkmeister-wmde: Backport for [[gerrit:1184166|Revert "Set $wgPHPSessionHandling to 'disable' on group0 wikis" (T362324 T403519)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:00:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:00:33] woooh https://test.wikidata.org/w/index.php?oldid=737809&diff=737810 [22:00:41] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister, lucaswerkmeister-wmde: Continuing with sync [22:01:16] (03PS1) 10Cwhite: airflow: disable icinga nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/1184169 (https://phabricator.wikimedia.org/T384214) [22:01:18] (03PS1) 10Cwhite: hiera: disable monitoring for legacy profile::airflow::instances [puppet] - 10https://gerrit.wikimedia.org/r/1184170 (https://phabricator.wikimedia.org/T384214) [22:01:20] (03PS1) 10Cwhite: airflow: remove nrpe definitions [puppet] - 10https://gerrit.wikimedia.org/r/1184171 (https://phabricator.wikimedia.org/T384214) [22:01:43] (03CR) 10CI reject: [V:04-1] airflow: disable icinga nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/1184169 (https://phabricator.wikimedia.org/T384214) (owner: 10Cwhite) [22:03:01] (03PS2) 10Cwhite: airflow: disable icinga nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/1184169 (https://phabricator.wikimedia.org/T384214) [22:04:58] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:05:57] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184166|Revert "Set $wgPHPSessionHandling to 'disable' on group0 wikis" (T362324 T403519)]] (duration: 10m 03s) [22:06:02] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [22:06:02] T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519 [22:06:09] * Lucas_WMDE done deploying [22:06:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:06:40] !log UTC late backport+config window (belatedly) done [22:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:45] I'm not saying I did just now get an "invalid response from server" on trying to save an edit like seconds ago when that deploy would've finished, but I maybe sorta did [22:06:49] it went away on reload and trying again tho [22:08:54] PROBLEM - snapshot of x1 in eqiad on backupmon1001 is CRITICAL: Last snapshot for x1 at eqiad (db1216) taken on 2025-09-02 21:37:14 is 308 GiB, but the previous one was 386 GiB, a change of -20.2 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [22:09:32] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:11:24] hmmmm [22:11:27] that’s definitely not troubling at all [22:12:29] definitely not [22:14:52] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:17:56] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:28:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:30:46] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:31:18] perryprog: any further errors? [22:31:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:31:51] Nothing; I think you're safe [22:32:03] maybe ant got in my computer or something [22:32:13] phew, thanks ^^ [22:32:39] logspam-watch looks mostly okay fwiw, though there’s a spike of Stats: Label value cannot be empty. [22:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [22:33:17] though it looks like that also happened earlier today already, nevermind [22:33:53] (I guess that’s T403512) [22:33:53] T403512: PHP Warning: Stats: (RateLimiter_limit_actions_total) Stats: Label value cannot be empty. - https://phabricator.wikimedia.org/T403512 [22:34:22] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:35:24] * Lucas_WMDE afk, if you need me ping me elsewhere [22:44:41] FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:49:55] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:50:22] lucaswerkmeister, happening again. Console errors: https://phabricator.wikimedia.org/P82432 [22:50:39] I'm not a fan of how the errors have to do with session IDs... [22:53:34] oh no [22:55:08] * Lucas_WMDE looks at client errors logstash [22:56:19] nothing at all in logstash, so I guess this doesn’t get logged [22:57:01] we have a client logstash? Good chance my content blockers could hit it, though if there isn't anything it's surely not widespread [22:57:17] plus successful edit rate seems fine [22:57:35] but I think this has to be a bug somewhere in https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaEvents/+/82701978ceff97a0e24fcc73f0303afb4a1cfcb1/modules/ext.wikimediaEvents/editAttemptStep.js [22:57:43] (only codesearch result for session.editing_session_id) [22:57:54] totally different kind of session AFAICT [22:58:20] perryprog: yeah, it’s been around for a few years I think; but it’s possible it gets blocked (or respects do-not-track or something), I don’t know [22:58:41] neat! [22:58:59] https://wikitech.wikimedia.org/wiki/Client_errors [22:59:21] (03PS2) 10Jdlrobson: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180689 [23:00:06] (03Abandoned) 10Jdlrobson: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180689 (owner: 10Jdlrobson) [23:00:42] perryprog: hang on, on which wiki are you seeing these errors [23:00:46] enwiki [23:01:00] oh! That was group0! [23:01:02] then I’m fairly confident it’s not due to those config changes [23:01:07] they were group1 and then group0 yeah [23:01:23] okay word—I think I got group0 and group2 backwards [23:01:27] (03CR) 10Cwhite: [C:03+2] Send email alerts to Reading Web "Performance Alert" Slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298) (owner: 10Jdlrobson) [23:01:28] it’s *possible* that the session handler has some spooky action at a distance, but I think it’s more likely that these errors are unrelated [23:01:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:09:06] 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on an-worker1141:9290 - https://phabricator.wikimedia.org/T403561 (10phaultfinder) 03NEW [23:09:07] 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on an-worker1141:9290 - https://phabricator.wikimedia.org/T403562 (10phaultfinder) 03NEW [23:10:49] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:10:56] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:19:32] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:21:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T402925)', diff saved to https://phabricator.wikimedia.org/P82433 and previous config saved to /var/cache/conftool/dbconfig/20250902-232107-ladsgroup.json [23:21:12] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [23:21:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:28:24] (03CR) 10Jdlrobson: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298) (owner: 10Jdlrobson) [23:28:56] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:29:30] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 8.505 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:29:54] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:31:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:36:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P82434 and previous config saved to /var/cache/conftool/dbconfig/20250902-233615-ladsgroup.json [23:38:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184178 [23:38:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184178 (owner: 10TrainBranchBot) [23:51:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P82435 and previous config saved to /var/cache/conftool/dbconfig/20250902-235121-ladsgroup.json [23:52:29] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184178 (owner: 10TrainBranchBot) [23:53:56] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources