[00:05:17] <wikibugs>	 (03PS1) 10Chlod Alejandro: tlwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433)
[00:08:04] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183751
[00:08:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183751 (owner: 10TrainBranchBot)
[00:15:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403356#11137589 (10phaultfinder)
[00:30:34] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183751 (owner: 10TrainBranchBot)
[00:31:46] <jinxer-wm>	 FIRING: Traffic bill over quota: Alert for device cr3-ulsfo.wikimedia.org - Traffic bill over quota Has worsened   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[00:36:46] <jinxer-wm>	 FIRING: [3x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[00:38:58] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11137605 (10phaultfinder)
[00:41:34] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2201.codfw.wmnet with reason: Maintenance
[00:43:53] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11137606 (10phaultfinder)
[00:44:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403356#11137607 (10phaultfinder)
[00:51:46] <jinxer-wm>	 FIRING: [3x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[00:56:46] <jinxer-wm>	 RESOLVED: [2x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[01:00:53] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[01:01:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:07:48] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.17 [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183752 (https://phabricator.wikimedia.org/T396378)
[01:07:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.17 [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183752 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot)
[01:12:48] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 55s)
[01:22:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-aux_30443: Servers aux-k8s-worker1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:23:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:25:01] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.17 [core] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183752 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot)
[01:29:36] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[01:44:36] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[01:59:20] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2231.codfw.wmnet with reason: Maintenance
[01:59:28] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2231 (T403362)', diff saved to https://phabricator.wikimedia.org/P82357 and previous config saved to /var/cache/conftool/dbconfig/20250902-015927-ladsgroup.json
[01:59:31] <stashbot>	 T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0200)
[02:27:36] <wikibugs>	 (03PS1) 10DDesouza: Pre-deploy Newcomers survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183753 (https://phabricator.wikimedia.org/T402915)
[02:29:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183753 (https://phabricator.wikimedia.org/T402915) (owner: 10DDesouza)
[02:32:54] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[02:39:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403356#11137669 (10phaultfinder)
[02:57:04] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2231 (T403362)', diff saved to https://phabricator.wikimedia.org/P82358 and previous config saved to /var/cache/conftool/dbconfig/20250902-025704-ladsgroup.json
[02:57:07] <stashbot>	 T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362
[02:59:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403356#11137676 (10phaultfinder)
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0300)
[03:02:01] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183755 (https://phabricator.wikimedia.org/T396378)
[03:02:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183755 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot)
[03:02:53] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183755 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot)
[03:03:17] <logmsgbot>	 !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.17  refs T396378
[03:03:20] <stashbot>	 T396378: 1.45.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T396378
[03:04:36] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[03:12:12] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2231', diff saved to https://phabricator.wikimedia.org/P82359 and previous config saved to /var/cache/conftool/dbconfig/20250902-031211-ladsgroup.json
[03:27:20] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2231', diff saved to https://phabricator.wikimedia.org/P82360 and previous config saved to /var/cache/conftool/dbconfig/20250902-032719-ladsgroup.json
[03:42:27] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2231 (T403362)', diff saved to https://phabricator.wikimedia.org/P82361 and previous config saved to /var/cache/conftool/dbconfig/20250902-034226-ladsgroup.json
[03:42:30] <stashbot>	 T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362
[03:47:08] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.17  refs T396378 (duration: 43m 50s)
[03:47:11] <stashbot>	 T396378: 1.45.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T396378
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0400)
[04:01:14] <logmsgbot>	 !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.14 (duration: 01m 04s)
[04:03:54] <wikibugs>	 (03CR) 10Anzx: [C:03+1] tlwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) (owner: 10Chlod Alejandro)
[04:10:47] <wikibugs>	 (03CR) 10Anzx: [C:03+1] "Please add comment_1_5x and comment_2x with task ID, It's good to associate the logo change with the specific task it relates to." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) (owner: 10Chlod Alejandro)
[04:11:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:16:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:25:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:25:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[04:26:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:39:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183703 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas)
[04:43:54] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11137713 (10phaultfinder)
[04:48:52] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11137716 (10phaultfinder)
[05:01:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:07:47] <wikibugs>	 (03PS1) 10KartikMistry: cxserver: staging: Update to 2025-09-02-045916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183761 (https://phabricator.wikimedia.org/T394982)
[05:08:40] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:21:22] <icinga-wm>	 RECOVERY - mysqld processes on es2026 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[05:22:24] <icinga-wm>	 RECOVERY - MariaDB read only es2 on es2026 is OK: Version 10.11.13-MariaDB-log, Uptime 66s, read_only: True, event_scheduler: True, 11.30 QPS, connection latency: 0.021679s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[05:25:05] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2026 gradually with 4 steps - Pool es2026.codfw.wmnet in after cloning
[05:29:36] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[05:31:12] <icinga-wm>	 PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 134 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:33:40] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:44:36] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[05:59:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0600).
[06:04:21] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] "sampling traffic to confirm analysis in T402611#11131164" [puppet] - 10https://gerrit.wikimedia.org/r/1183698 (owner: 10Arnaudb)
[06:04:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:08:48] <wikibugs>	 (03PS1) 10Arnaudb: Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1183766
[06:10:34] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2026 gradually with 4 steps - Pool es2026.codfw.wmnet in after cloning
[06:10:35] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es2026.codfw.wmnet onto es2049.codfw.wmnet
[06:23:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:28:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:31:10] <icinga-wm>	 RECOVERY - Backup freshness on backup1014 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:32:45] <jinxer-wm>	 FIRING: [2x] Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota Has worsened   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[06:32:54] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[06:34:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1183707 (https://phabricator.wikimedia.org/T403154) (owner: 10Filippo Giunchedi)
[06:37:45] <jinxer-wm>	 FIRING: [3x] Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota Has worsened   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[06:41:47] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11137799 (10MoritzMuehlenhoff) All five replicas on maps-test have been re-synched and the Postgres log files look good now.
[06:42:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[06:46:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for esams03 - jmm@cumin2002"
[06:46:59] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11137802 (10MoritzMuehlenhoff)
[06:47:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for esams03 - jmm@cumin2002"
[06:47:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:49:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Assign ganeti_routed role to ganeti3006 and configure cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/1183704 (owner: 10Muehlenhoff)
[06:49:43] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11137808 (10ayounsi)
[06:50:35] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::amd_gpu: add a flag to deploy firmwares from Bookworm BPO (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183678 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey)
[06:50:45] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Delete profile::python38 [puppet] - 10https://gerrit.wikimedia.org/r/1183680 (owner: 10Elukey)
[06:51:01] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] Add a new insetup role for ml-k8s hosts to test their GPU [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey)
[06:51:10] <wikibugs>	 (03PS6) 10Elukey: Add a new insetup role for ml-k8s hosts to test their GPU [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948)
[06:52:45] <jinxer-wm>	 FIRING: [3x] Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota Has worsened   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[06:52:46] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Add a new insetup role for ml-k8s hosts to test their GPU [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey)
[06:53:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:54:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:57:45] <jinxer-wm>	 RESOLVED: Traffic bill over quota: Alert for device cr2-esams.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0700).
[07:00:05] <jouncebot>	 hueitan and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:19] <hueitan>	 o/
[07:00:22] <kart_>	 here
[07:00:34] <kart_>	 I'll start with hueitan's change..
[07:00:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183692 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan)
[07:03:17] <wikibugs>	 (03Merged) 10jenkins-bot: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183692 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan)
[07:03:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet
[07:04:00] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1183692|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]]
[07:04:03] <stashbot>	 T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496
[07:04:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183703 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas)
[07:04:36] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[07:05:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183703 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas)
[07:10:28] <logmsgbot>	 !log kartik@deploy1003 hueitan, kartik: Backport for [[gerrit:1183692|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:10:31] <stashbot>	 T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496
[07:11:05] <kart_>	 hueitan: you can test the patch on wmf.16 Wikis.
[07:13:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet
[07:13:28] <logmsgbot>	 !log kartik@deploy1003 hueitan, kartik: Continuing with sync
[07:14:17] <wikibugs>	 (03PS1) 10Huei Tan: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183973 (https://phabricator.wikimedia.org/T402496)
[07:14:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183973 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan)
[07:19:34] <moritzm>	 !log create ganeti03 cluster T402259
[07:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:37] <stashbot>	 T402259: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259
[07:20:44] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183692|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] (duration: 16m 43s)
[07:20:47] <stashbot>	 T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496
[07:21:26] <kart_>	 hueitan: I'll deploy second patch once CI is passed.
[07:25:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183973 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan)
[07:25:53] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1183766 (owner: 10Arnaudb)
[07:26:40] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11137841 (10MoritzMuehlenhoff)
[07:27:07] <wikibugs>	 (03Merged) 10jenkins-bot: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1183973 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan)
[07:27:36] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1183973|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]]
[07:27:39] <stashbot>	 T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496
[07:33:02] <Mvolz>	 jouncebot: nowandnext
[07:33:02] <jouncebot>	 For the next 0 hour(s) and 26 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T0700)
[07:33:03] <jouncebot>	 In 2 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1000)
[07:33:30] <logmsgbot>	 !log kartik@deploy1003 hueitan, kartik: Backport for [[gerrit:1183973|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:33:34] <stashbot>	 T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496
[07:34:09] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:puppetserver::volatile enable datacenter timer [puppet] - 10https://gerrit.wikimedia.org/r/1183612 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[07:36:16] <logmsgbot>	 !log kartik@deploy1003 hueitan, kartik: Continuing with sync
[07:37:14] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11137855 (10TheDJ) There should probably be an alert for `could not receive data from WAL stream`.. there's at least 3 old closed tickets  in phab with exactly the same log line and s...
[07:39:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Add esams03 to Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1183978 (https://phabricator.wikimedia.org/T402259)
[07:39:58] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Set log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183979
[07:41:14] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1183978 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[07:41:35] <Mvolz>	 Could I slot in for a deploy after this one? 
[07:41:48] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183973|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] (duration: 14m 12s)
[07:41:52] <stashbot>	 T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496
[07:42:10] <hueitan>	 Mvolz sure, we almost done.
[07:42:20] <hueitan>	 kart_will let us k now
[07:42:23] <Mvolz>	 cool
[07:42:36] <wikibugs>	 (03PS2) 10Kosta Harlan: hCaptcha: Set log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183979
[07:42:48] <wikibugs>	 (03PS3) 10Kosta Harlan: hCaptcha: Set log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183979
[07:43:03] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11137859 (10MoritzMuehlenhoff) >>! In T381565#11137855, @TheDJ wrote: > There should probably be an alert for `could not receive data from WAL stream`.. there's at least 3 old closed...
[07:43:40] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:43:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add esams03 to Netbox sync [puppet] - 10https://gerrit.wikimedia.org/r/1183978 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[07:47:17] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "current approach will alter our metrics, please track wdqs as `ua_policy:wdqs` (same as in Varnish)" [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (owner: 10Slyngshede)
[07:48:40] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:49:28] <Mvolz>	 kart_: looks like scap is done? Would it be okay for me to start my config change? 
[07:53:40] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:54:44] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'.
[07:58:40] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:59:31] <kart_>	 Mvolz: sorry, missed msg
[07:59:36] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:59:38] <kart_>	 Mvolz: Please go ahead.
[08:03:40] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:04:47] <wikibugs>	 (03PS2) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119)
[08:04:50] <wikibugs>	 (03CR) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[08:05:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Add replacement insetup VMS for VMs currently running on esams01 [puppet] - 10https://gerrit.wikimedia.org/r/1184031 (https://phabricator.wikimedia.org/T402259)
[08:12:21] <Mvolz>	 eh, windows over, i'll do it some other time. 
[08:17:02] <wikibugs>	 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11137889 (10JMeybohm) >> Also these IPs/hosts might and will change in the future so they would have to be updates regularly. >  > How often might that happen? I can...
[08:17:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] java: add support for Trixie / Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1183707 (https://phabricator.wikimedia.org/T403154) (owner: 10Filippo Giunchedi)
[08:17:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[08:24:26] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:25:10] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1183975 (https://phabricator.wikimedia.org/T402611)
[08:25:25] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:25:54] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[08:26:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:27:35] <wikibugs>	 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11137915 (10MoritzMuehlenhoff) >>! In T403298#11137889, @JMeybohm wrote: >> How often might that happen? > I can't say for sure. Definitely for every Debian OS versi...
[08:40:05] <wikibugs>	 (03PS1) 10Arnaudb: Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1184032
[08:40:49] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-serve1013:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:42:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[08:43:18] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1184032 (owner: 10Arnaudb)
[08:43:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Line-wrap Homer diffs" [puppet] - 10https://gerrit.wikimedia.org/r/1184033
[08:47:01] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1083.eqiad.wmnet with OS bullseye
[08:47:31] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1083.eqiad.wmnet...
[08:47:53] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1084.eqiad.wmnet with OS bullseye
[08:48:15] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137956 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1084.eqiad.wmnet...
[08:48:24] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add replacement insetup VMS for VMs currently running on esams01 [puppet] - 10https://gerrit.wikimedia.org/r/1184031 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[08:48:57] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11137957 (10phaultfinder)
[08:50:47] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1085.eqiad.wmnet with OS bullseye
[08:51:01] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137958 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1085.eqiad.wmnet...
[08:53:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add replacement insetup VMS for VMs currently running on esams01 [puppet] - 10https://gerrit.wikimedia.org/r/1184031 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[08:53:52] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11137960 (10phaultfinder)
[08:54:56] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'.
[08:57:48] <wikibugs>	 (03PS1) 10Elukey: apt: add the non-free-firmware component for Bookworm bpo [puppet] - 10https://gerrit.wikimedia.org/r/1184035
[09:00:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:01:27] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184033 (owner: 10Muehlenhoff)
[09:01:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:02:17] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11137967 (10ayounsi)
[09:03:27] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "indeed, I missed that the issue is with rancid and not homer" [puppet] - 10https://gerrit.wikimedia.org/r/1184033 (owner: 10Muehlenhoff)
[09:05:40] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:06:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3005.esams.wmnet
[09:06:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:07:06] <wikibugs>	 (03CR) 10Vgutierrez: P:cache::haproxy disallow Wikidata Query Service as UA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[09:07:19] <wikibugs>	 (03PS1) 10Slyngshede: P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161)
[09:08:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[09:08:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1184035 (owner: 10Elukey)
[09:08:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "Line-wrap Homer diffs" [puppet] - 10https://gerrit.wikimedia.org/r/1184033 (owner: 10Muehlenhoff)
[09:10:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[09:10:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[09:10:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:10:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors
[09:10:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors
[09:10:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:10:56] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1084.eqiad.wmnet with reason: host reimage
[09:11:33] <wikibugs>	 (03CR) 10Elukey: [C:03+2] apt: add the non-free-firmware component for Bookworm bpo [puppet] - 10https://gerrit.wikimedia.org/r/1184035 (owner: 10Elukey)
[09:13:17] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1085.eqiad.wmnet with reason: host reimage
[09:14:17] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1084.eqiad.wmnet with reason: host reimage
[09:14:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[09:14:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[09:14:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:14:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors
[09:14:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors
[09:14:58] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir3005.esams.wmnet
[09:15:18] <wikibugs>	 (03PS2) 10Chlod Alejandro: tlwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433)
[09:15:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3005.esams.wmnet
[09:15:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:16:00] <wikibugs>	 (03CR) 10Chlod Alejandro: "Done! Didn't know the `comment` doesn't automatically add the comments for those two in. Perhaps another change can be made for that in `m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) (owner: 10Chlod Alejandro)
[09:17:56] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1085.eqiad.wmnet with reason: host reimage
[09:19:05] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11138061 (10elukey) I tested the following and I see the correct image:  ` curl -s "https://kartotherian.svc.codfw.wmnet:6543/img/osm-intl,14,a,a,300x200.png?lang=en&domain=en.wikiped...
[09:19:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[09:19:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[09:19:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:19:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors
[09:19:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors
[09:19:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:21:20] <wikibugs>	 (03PS1) 10Tiziano Fogli: MysqlSustainedReplLag: replace Icinga-based PromQL checks [alerts] - 10https://gerrit.wikimedia.org/r/1184039 (https://phabricator.wikimedia.org/T315866)
[09:23:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[09:23:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[09:23:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:23:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors
[09:23:26] <wikibugs>	 (03CR) 10Anzx: [C:03+1] tlwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183750 (https://phabricator.wikimedia.org/T403433) (owner: 10Chlod Alejandro)
[09:23:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors
[09:23:31] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir3005.esams.wmnet
[09:23:44] <logmsgbot>	 !log mvernon@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1083.eqiad.wmnet with OS bullseye
[09:23:47] <icinga-wm>	 PROBLEM - Host ml-serve1013 is DOWN: PING CRITICAL - Packet loss = 100%
[09:24:00] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11138067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1083.eqiad.wmnet wit...
[09:24:24] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Use the standby analytics_meta mariadb server temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1183642 (https://phabricator.wikimedia.org/T394498) (owner: 10Btullis)
[09:24:28] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1083.eqiad.wmnet with OS bullseye
[09:24:47] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11138068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1083.eqiad.wmnet...
[09:25:15] <icinga-wm>	 RECOVERY - Host ml-serve1013 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms
[09:25:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on ml-serve1013:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:26:26] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw
[09:27:00] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'.
[09:27:05] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'.
[09:27:56] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'.
[09:28:37] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'.
[09:29:36] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[09:29:47] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'.
[09:29:55] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'.
[09:31:17] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1084.eqiad.wmnet with OS bullseye
[09:31:38] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11138092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1084.eqiad.wmnet wit...
[09:31:51] <wikibugs>	 (03PS2) 10Slyngshede: P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161)
[09:32:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[09:33:11] <icinga-wm>	 PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 134 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[09:33:42] <wikibugs>	 (03PS3) 10Slyngshede: P:cache::haproxy copy datacenter.mmdb [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161)
[09:33:54] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1085.eqiad.wmnet with OS bullseye
[09:34:18] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11138096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1085.eqiad.wmnet wit...
[09:37:06] <wikibugs>	 (03PS1) 10Filippo Giunchedi: openstack: add wmcs-server-id [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407)
[09:38:36] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'.
[09:40:22] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6826/co" [puppet] - 10https://gerrit.wikimedia.org/r/1184037 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[09:40:40] <jinxer-wm>	 RESOLVED: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:42:13] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11138109 (10elukey) @TheDJ Hi! As FYI I just repooled maps codfw, we don't see anymore issues but please let us know if you see anything weird. Thanks!
[09:44:36] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:58:18] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Fix KServe batcher setup for edit-check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423)
[09:58:21] <wikibugs>	 (03PS1) 10Btullis: Upgrade the dse-k8s-codfw cluster to version 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1184043 (https://phabricator.wikimedia.org/T396478)
[09:59:40] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1184043 (https://phabricator.wikimedia.org/T396478) (owner: 10Btullis)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1000)
[10:02:57] <wikibugs>	 06SRE: offboard-user: Check for use of email address of user to be offboarded across Puppet repo - https://phabricator.wikimedia.org/T403452 (10Aklapper) 03NEW
[10:04:43] <icinga-wm>	 PROBLEM - MariaDB Replica IO: analytics_meta on db1208 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: could not find next log: the first event analytics-meta-bin.000197 at 258619100, the last event read from analytics-meta-bin.000271 at 667686198, the last byte read from analytics-meta-bin.000271 at 667686229. https://wikitech.wikimedia.or
[10:04:43] <icinga-wm>	 ariaDB/troubleshooting%23Depooling_a_replica
[10:06:23] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Use the standby analytics_meta mariadb server temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1183642 (https://phabricator.wikimedia.org/T394498) (owner: 10Btullis)
[10:10:49] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "Thanks for spotting this! can you also add the change to the experimental namespace?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) (owner: 10Gkyziridis)
[10:11:22] <wikibugs>	 (03PS1) 10Aklapper: offboard-user: Remove "Security" from privileged Phabricator projects [puppet] - 10https://gerrit.wikimedia.org/r/1184044
[10:11:56] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Facilitate a role swap between an-mariadb1001 and an-mariadb1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183643 (https://phabricator.wikimedia.org/T394498) (owner: 10Btullis)
[10:12:39] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Upgrade the dse-k8s-codfw cluster to version 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1184043 (https://phabricator.wikimedia.org/T396478) (owner: 10Btullis)
[10:13:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d8-eqiad - https://phabricator.wikimedia.org/T401240#11138223 (10VRiley-WMF) |Device A|Device A Port|Device B|Device B Port|Type|Notes|Length required| |----------|-----------------|----------|----------|-------|-----|-------------...
[10:13:58] <wikibugs>	 (03Merged) 10jenkins-bot: Facilitate a role swap between an-mariadb1001 and an-mariadb1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183643 (https://phabricator.wikimedia.org/T394498) (owner: 10Btullis)
[10:14:48] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply
[10:15:18] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1083.eqiad.wmnet with reason: host reimage
[10:15:38] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable JA3N fingerprinting CDN wide [puppet] - 10https://gerrit.wikimedia.org/r/1184046 (https://phabricator.wikimedia.org/T400119)
[10:16:42] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Enable JA3N fingerprinting CDN wide [puppet] - 10https://gerrit.wikimedia.org/r/1184046 (https://phabricator.wikimedia.org/T400270)
[10:16:51] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply
[10:17:23] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply
[10:17:31] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184046 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez)
[10:19:12] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[10:19:12] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1083.eqiad.wmnet with reason: host reimage
[10:19:36] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[10:21:23] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2150.codfw.wmnet with reason: Maintenance
[10:21:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T401906)', diff saved to https://phabricator.wikimedia.org/P82368 and previous config saved to /var/cache/conftool/dbconfig/20250902-102130-fceratto.json
[10:21:36] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[10:21:38] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply
[10:21:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet
[10:22:05] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply
[10:23:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T401906)', diff saved to https://phabricator.wikimedia.org/P82369 and previous config saved to /var/cache/conftool/dbconfig/20250902-102353-fceratto.json
[10:24:55] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply
[10:25:06] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production
[10:28:05] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] rest-gateway: Add rest-gateway-ro domain matchers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183086 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert)
[10:28:23] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply
[10:29:18] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production
[10:30:09] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Add rest-gateway-ro domain matchers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183086 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert)
[10:30:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet
[10:31:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3005.esams.wmnet
[10:31:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:32:54] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[10:34:26] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:34:34] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] hiera: Enable JA3N fingerprinting CDN wide [puppet] - 10https://gerrit.wikimedia.org/r/1184046 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez)
[10:34:35] <wikibugs>	 (03PS2) 10Gkyziridis: ml-services: Fix KServe batcher setup for edit-check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423)
[10:34:43] <icinga-wm>	 RECOVERY - MariaDB Replica IO: analytics_meta on db1208 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:35:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[10:35:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[10:35:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:35:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors
[10:35:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors
[10:35:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:35:52] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1083.eqiad.wmnet with OS bullseye
[10:36:06] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:36:08] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11138352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1083.eqiad.wmnet wit...
[10:36:13] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:36:34] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:36:47] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:37:06] <wikibugs>	 (03CR) 10FNegri: openstack: add wmcs-server-id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) (owner: 10Filippo Giunchedi)
[10:37:22] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:37:30] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:38:29] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply
[10:38:48] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'.
[10:39:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P82370 and previous config saved to /var/cache/conftool/dbconfig/20250902-103901-fceratto.json
[10:40:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[10:40:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[10:40:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:40:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors
[10:40:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors
[10:40:22] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir3005.esams.wmnet
[10:40:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3005.esams.wmnet
[10:40:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:43:33] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Enable JA3N fingerprinting CDN wide [puppet] - 10https://gerrit.wikimedia.org/r/1184046 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez)
[10:43:42] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "One comment on the maxreplicas for experimental ns, other than that looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) (owner: 10Gkyziridis)
[10:44:26] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!  I think the dict is right based on what's on the boxes." [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi)
[10:44:26] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:44:40] <jinxer-wm>	 FIRING: [3x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:44:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[10:45:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3005.esams.wmnet - jmm@cumin2002"
[10:45:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:45:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3005.esams.wmnet on all recursors
[10:45:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3005.esams.wmnet on all recursors
[10:45:11] <wikibugs>	 (03CR) 10Cathal Mooney: "Actually I notice one nit in-line" [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi)
[10:45:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3005.esams.wmnet - jmm@cumin2002"
[10:45:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3005.esams.wmnet - jmm@cumin2002"
[10:46:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3005.esams.wmnet with OS bookworm
[10:46:39] <wikibugs>	 (03PS3) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119)
[10:46:45] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11138377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ncredir3005.esams.wmnet with OS bookworm
[10:47:40] <wikibugs>	 06SRE, 06Traffic, 10Wikidata, 10Wikidata-Query-Service: Find a solution for SPARQL federation that is blocked by stricter user agent policy enforcement - https://phabricator.wikimedia.org/T402959#11138378 (10gmodena) >>! In T402959#11132802, @CDanis wrote: > Hi @Lydia_Pintscher , SRE can make some exceptio...
[10:47:48] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6828/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[10:49:18] <wikibugs>	 (03CR) 10Cathal Mooney: Nokia: /routing-policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi)
[10:49:36] <wikibugs>	 (03CR) 10Vgutierrez: P:cache::haproxy disallow Wikidata Query Service as UA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[10:49:40] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:51:03] <wikibugs>	 (03PS4) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119)
[10:51:47] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: re-add 3 nodes, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1183628 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon)
[10:51:52] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6829/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[10:52:02] <wikibugs>	 (03PS3) 10MVernon: swift: re-add 3 nodes, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1183628 (https://phabricator.wikimedia.org/T400877)
[10:53:26] <wikibugs>	 (03PS3) 10Gkyziridis: ml-services: Fix KServe batcher setup for edit-check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423)
[10:53:28] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: re-add 3 nodes, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1183628 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon)
[10:54:01] <wikibugs>	 (03PS5) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119)
[10:54:03] <wikibugs>	 (03CR) 10Cathal Mooney: Nokia: /routing-policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi)
[10:54:03] <wikibugs>	 (03CR) 10Gkyziridis: ml-services: Fix KServe batcher setup for edit-check. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) (owner: 10Gkyziridis)
[10:54:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P82371 and previous config saved to /var/cache/conftool/dbconfig/20250902-105411-fceratto.json
[10:54:50] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6830/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[10:58:50] <wikibugs>	 (03PS6) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119)
[10:58:53] <wikibugs>	 (03CR) 10Ayounsi: Nokia: /routing-policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi)
[10:59:22] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11138396 (10MatthewVernon)
[10:59:47] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6831/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[11:00:57] <wikibugs>	 (03PS7) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119)
[11:02:19] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6832/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[11:02:25] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Yep no problem with this +1." [homer/public] - 10https://gerrit.wikimedia.org/r/1183099 (owner: 10Ayounsi)
[11:04:22] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[11:04:36] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[11:08:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3005.esams.wmnet with reason: host reimage
[11:09:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T401906)', diff saved to https://phabricator.wikimedia.org/P82372 and previous config saved to /var/cache/conftool/dbconfig/20250902-110919-fceratto.json
[11:09:23] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[11:09:35] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2159.codfw.wmnet with reason: Maintenance
[11:09:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T401906)', diff saved to https://phabricator.wikimedia.org/P82373 and previous config saved to /var/cache/conftool/dbconfig/20250902-110942-fceratto.json
[11:11:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looking at the current stats of maps-test2002 we're having 50ish connections. But raising this surely can't hurt either, the new maps node" [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[11:12:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T401906)', diff saved to https://phabricator.wikimedia.org/P82374 and previous config saved to /var/cache/conftool/dbconfig/20250902-111203-fceratto.json
[11:12:11] <wikibugs>	 (03PS2) 10Cathal Mooney: JunOS IBGP: adjust template to work with updated data from plugin [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577)
[11:13:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3005.esams.wmnet with reason: host reimage
[11:15:33] <wikibugs>	 (03CR) 10Cathal Mooney: JunOS IBGP: adjust template to work with updated data from plugin (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[11:23:40] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] P:cache::haproxy disallow Wikidata Query Service as UA (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[11:24:56] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "Great, let's go!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) (owner: 10Gkyziridis)
[11:25:25] <wikibugs>	 (03CR) 10Ayounsi: JunOS IBGP: adjust template to work with updated data from plugin (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[11:26:15] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[11:27:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P82375 and previous config saved to /var/cache/conftool/dbconfig/20250902-112711-fceratto.json
[11:28:07] <wikibugs>	 (03CR) 10Cathal Mooney: JunOS IBGP: adjust template to work with updated data from plugin (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[11:29:07] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Fix KServe batcher setup for edit-check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) (owner: 10Gkyziridis)
[11:29:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3005.esams.wmnet with OS bookworm
[11:29:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir3005.esams.wmnet
[11:30:37] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11138474 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ncredir3005.esams.wmnet with OS bookworm completed: - ncredi...
[11:31:20] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Fix KServe batcher setup for edit-check. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184042 (https://phabricator.wikimedia.org/T403423) (owner: 10Gkyziridis)
[11:32:46] <jinxer-wm>	 FIRING: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[11:34:01] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] "We could still make it configurable when needed. I just worry that too many nested dicts makes it too complex in the longer run." [homer/public] - 10https://gerrit.wikimedia.org/r/1183099 (owner: 10Ayounsi)
[11:35:22] <wikibugs>	 (03Merged) 10jenkins-bot: Nokia OSPF: different proposal [homer/public] - 10https://gerrit.wikimedia.org/r/1183099 (owner: 10Ayounsi)
[11:37:19] <Amir1>	 jouncebot: nowandnext
[11:37:19] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 22 minute(s)
[11:37:19] <jouncebot>	 In 0 hour(s) and 22 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1200)
[11:37:40] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[11:38:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host durum3005.esams.wmnet
[11:38:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:39:56] <kart_>	 Deploying cxserver. Staging only change.
[11:40:20] <Amir1>	 Emperor: I'm about to do the switchover of s1 setting all of English Wikipedia to read only and the best part, the script broke last time I did it so it might take longer as I'll might have to do a lot of stuff manually while the whole site is RO.
[11:40:27] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] cxserver: staging: Update to 2025-09-02-045916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183761 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry)
[11:42:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3005.esams.wmnet - jmm@cumin2002"
[11:42:05] <wikibugs>	 (03Merged) 10jenkins-bot: cxserver: staging: Update to 2025-09-02-045916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183761 (https://phabricator.wikimedia.org/T394982) (owner: 10KartikMistry)
[11:42:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P82376 and previous config saved to /var/cache/conftool/dbconfig/20250902-114219-fceratto.json
[11:43:01] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply
[11:43:24] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[11:43:40] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T402870
[11:43:43] <stashbot>	 T402870: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T402870
[11:44:09] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set db1184 with weight 0 T402870', diff saved to https://phabricator.wikimedia.org/P82377 and previous config saved to /var/cache/conftool/dbconfig/20250902-114408-ladsgroup.json
[11:44:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ncredir3005 as ncredir node [puppet] - 10https://gerrit.wikimedia.org/r/1184053 (https://phabricator.wikimedia.org/T402259)
[11:44:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3005.esams.wmnet - jmm@cumin2002"
[11:44:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:44:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache durum3005.esams.wmnet on all recursors
[11:44:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum3005.esams.wmnet on all recursors
[11:45:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3005.esams.wmnet - jmm@cumin2002"
[11:45:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3005.esams.wmnet - jmm@cumin2002"
[11:48:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host durum3005.esams.wmnet with OS bookworm
[11:49:04] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11138559 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host durum3005.esams.wmnet with OS bookworm
[11:52:46] <jinxer-wm>	 RESOLVED: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[11:54:15] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] Add ncredir3005 as ncredir node [puppet] - 10https://gerrit.wikimedia.org/r/1184053 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[11:57:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T401906)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250902-115727-fceratto.json
[11:57:47] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[11:57:47] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2168.codfw.wmnet with reason: Maintenance
[11:57:48] <wikibugs>	 (03PS3) 10Anzx: idwiki: Add extended confirmed usergroup & restriction level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183662 (https://phabricator.wikimedia.org/T402755)
[11:57:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T401906)', diff saved to https://phabricator.wikimedia.org/P82380 and previous config saved to /var/cache/conftool/dbconfig/20250902-115754-fceratto.json
[11:58:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183662 (https://phabricator.wikimedia.org/T402755) (owner: 10Anzx)
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1200)
[12:00:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T401906)', diff saved to https://phabricator.wikimedia.org/P82381 and previous config saved to /var/cache/conftool/dbconfig/20250902-120020-fceratto.json
[12:00:32] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[12:00:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add ncredir3005 as ncredir node [puppet] - 10https://gerrit.wikimedia.org/r/1184053 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[12:01:35] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[12:01:48] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "At the moment codfw takes a lot less traffic than eqiad, they are very imbalanced. My main concern is for when a single cluster needs to a" [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[12:01:54] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[12:02:34] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:03:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11138603 (10elukey) 05Resolved→03Open @Jclark-ctr Hi! I noticed that console redir seems not working for ml-serve1013 (but it works for 1012), and the bios settings...
[12:04:03] <Emperor>	 Amir1: ack, good luck...
[12:06:54] <wikibugs>	 10SRE-SLO, 10Citoid, 10VisualEditor, 10Editing-team (Kanban Board): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11138642 (10elukey) @Mvolz Hi! Sorry for the delay!  Prometheus metrics cannot be filtered...
[12:08:30] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad
[12:08:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3005.esams.wmnet with reason: host reimage
[12:10:57] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[12:11:41] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[12:13:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3005.esams.wmnet with reason: host reimage
[12:13:55] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1181799 (https://phabricator.wikimedia.org/T402870)
[12:13:59] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1181799 (https://phabricator.wikimedia.org/T402870) (owner: 10Gerrit maintenance bot)
[12:15:29] <Amir1>	 !log Starting s1 eqiad failover from db1163 to db1184 - T402870
[12:15:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P82382 and previous config saved to /var/cache/conftool/dbconfig/20250902-121531-fceratto.json
[12:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:32] <stashbot>	 T402870: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T402870
[12:15:49] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T402870', diff saved to https://phabricator.wikimedia.org/P82383 and previous config saved to /var/cache/conftool/dbconfig/20250902-121548-ladsgroup.json
[12:15:52] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[12:16:39] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[12:18:15] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote db1184 to s1 primary and set section read-write T402870', diff saved to https://phabricator.wikimedia.org/P82384 and previous config saved to /var/cache/conftool/dbconfig/20250902-121814-ladsgroup.json
[12:20:15] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181800 (https://phabricator.wikimedia.org/T402870)
[12:20:16] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181800 (https://phabricator.wikimedia.org/T402870) (owner: 10Gerrit maintenance bot)
[12:20:18] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181800 (https://phabricator.wikimedia.org/T402870) (owner: 10Gerrit maintenance bot)
[12:20:32] <logmsgbot>	 !log ladsgroup@dns1004 START - running authdns-update
[12:21:33] <logmsgbot>	 !log ladsgroup@dns1004 END - running authdns-update
[12:21:49] <wikibugs>	 (03PS1) 10Dreamy Jazz: tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471)
[12:22:29] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:04-1] "We need to decide on which wikis we will create these tables on and then create them on production before we merge this." [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz)
[12:23:11] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1163 T402870', diff saved to https://phabricator.wikimedia.org/P82385 and previous config saved to /var/cache/conftool/dbconfig/20250902-122310-ladsgroup.json
[12:23:18] <stashbot>	 T402870: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T402870
[12:24:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz)
[12:24:52] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=ncredir3005.esams.wmnet
[12:25:10] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir3005.esams.wmnet
[12:25:16] <wikibugs>	 (03PS1) 10Stevemunene: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301)
[12:25:17] <wikibugs>	 (03PS1) 10Stevemunene: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301)
[12:25:20] <wikibugs>	 (03PS1) 10Stevemunene: dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301)
[12:25:25] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:25:54] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[12:26:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[12:27:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ncredir3003 [puppet] - 10https://gerrit.wikimedia.org/r/1184062 (https://phabricator.wikimedia.org/T402259)
[12:28:08] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1163.eqiad.wmnet with reason: Old primary of s1
[12:28:22] <wikibugs>	 (03PS2) 10Dreamy Jazz: tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471)
[12:28:34] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:04-1] tables-catalog: Document new CheckUser database tables [puppet] - 10https://gerrit.wikimedia.org/r/1184058 (https://phabricator.wikimedia.org/T403471) (owner: 10Dreamy Jazz)
[12:29:10] <Amir1>	 Emperor: I'll be running an upgrade cookbook that sometimes triggers a page on db1163. I'm so sorry for this mess but if you get a page for db1163, please ignore
[12:29:19] <Amir1>	 (it removes the downtime when it shouldn't)
[12:29:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum3005.esams.wmnet with OS bookworm
[12:29:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum3005.esams.wmnet
[12:30:09] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11138768 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host durum3005.esams.wmnet with OS bookworm completed: - durum300...
[12:30:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P82387 and previous config saved to /var/cache/conftool/dbconfig/20250902-123038-fceratto.json
[12:31:15] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-mariadb1001.eqiad.wmnet
[12:31:36] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for db1163.eqiad.wmnet
[12:31:44] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.depool db1163 - Upgrading db1163.eqiad.wmnet
[12:31:51] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1163 - Upgrading db1163.eqiad.wmnet
[12:32:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[12:32:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[12:35:38] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-mariadb1001.eqiad.wmnet
[12:38:03] <wikibugs>	 (03PS1) 10Elukey: admin_ng: bump max pod's memory usage for edit check on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184063 (https://phabricator.wikimedia.org/T403423)
[12:38:50] <wikibugs>	 (03PS1) 10Brouberol: airflow-test-k8s: increase DAG file parsing interval [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184064 (https://phabricator.wikimedia.org/T402529)
[12:39:58] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-test-k8s: increase DAG file parsing interval [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184064 (https://phabricator.wikimedia.org/T402529) (owner: 10Brouberol)
[12:40:25] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-restart-ats rolling restart_daemons on A:cp
[12:40:39] <wikibugs>	 (03PS2) 10Brouberol: airflow-test-k8s: increase DAG file parsing interval [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184064 (https://phabricator.wikimedia.org/T402529)
[12:42:47] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: increase DAG file parsing interval [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184064 (https://phabricator.wikimedia.org/T402529) (owner: 10Brouberol)
[12:43:44] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1163.eqiad.wmnet
[12:44:38] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1001.eqiad.wmnet
[12:44:39] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts an-mariadb1001.eqiad.wmnet
[12:44:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng: bump max pod's memory usage for edit check on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184063 (https://phabricator.wikimedia.org/T403423) (owner: 10Elukey)
[12:44:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403431#11138809 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[12:45:31] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[12:45:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T401906)', diff saved to https://phabricator.wikimedia.org/P82388 and previous config saved to /var/cache/conftool/dbconfig/20250902-124545-fceratto.json
[12:45:49] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[12:46:00] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[12:46:01] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2182.codfw.wmnet with reason: Maintenance
[12:46:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T401906)', diff saved to https://phabricator.wikimedia.org/P82389 and previous config saved to /var/cache/conftool/dbconfig/20250902-124608-fceratto.json
[12:47:16] <wikibugs>	 (03PS2) 10Elukey: admin_ng: bump max pod's memory usage for edit check on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184063 (https://phabricator.wikimedia.org/T403423)
[12:48:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T401906)', diff saved to https://phabricator.wikimedia.org/P82390 and previous config saved to /var/cache/conftool/dbconfig/20250902-124830-fceratto.json
[12:51:05] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1163.eqiad.wmnet with reason: Old primary of s1
[12:53:55] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11138841 (10phaultfinder)
[12:56:38] <wikibugs>	 (03PS2) 10Stevemunene: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301)
[12:56:38] <wikibugs>	 (03PS2) 10Stevemunene: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301)
[12:56:38] <wikibugs>	 (03PS2) 10Stevemunene: dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301)
[12:57:41] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184063 (https://phabricator.wikimedia.org/T403423) (owner: 10Elukey)
[12:58:36] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir3003.esams.wmnet
[12:58:55] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11138862 (10phaultfinder)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1300).
[13:00:05] <jouncebot>	 Tran, JustHannah, kart_, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <Tran>	 o/
[13:00:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh3005.wikimedia.org
[13:00:12] <anzx>	 o/
[13:00:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:00:20] <kart_>	 here
[13:00:44] <Lucas_WMDE>	 I’m here but don’t really have time for deploying right now :/
[13:00:51] <kart_>	 I can self deploy Nik's patch.
[13:00:54] <Tran>	 I can deploy my own
[13:01:02] <kart_>	 Go ahead Tran
[13:01:13] <Tran>	 Thanks. Going to start - two going in at the same time as one is a no-op comment update.
[13:01:37] <kart_>	 Lucas_WMDE: I can deploy if needed.
[13:01:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:01:59] <wikibugs>	 (03PS1) 10Btullis: Add four hadoop workers from repurposed dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/1184068 (https://phabricator.wikimedia.org/T398438)
[13:02:00] <kart_>	 JustHannah and anzx let me know if you need help in deployment
[13:02:13] <JustHannah>	 I need help, please!
[13:02:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Apply the durum role on durum3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184069
[13:02:51] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: bump max pod's memory usage for edit check on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184063 (https://phabricator.wikimedia.org/T403423) (owner: 10Elukey)
[13:03:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Apply the durum role on durum3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (owner: 10Muehlenhoff)
[13:03:13] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Apply the durum role on durum3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (owner: 10Muehlenhoff)
[13:03:15] <anzx>	 kart_: i also need someone to deploy 
[13:03:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154308 (https://phabricator.wikimedia.org/T396217) (owner: 10Tchanders)
[13:03:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders)
[13:03:30] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Apply the durum role on durum3005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (owner: 10Muehlenhoff)
[13:03:37] <JustHannah>	 kart_: +1
[13:03:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P82391 and previous config saved to /var/cache/conftool/dbconfig/20250902-130338-fceratto.json
[13:03:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[13:03:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3005.wikimedia.org - jmm@cumin2002"
[13:04:18] <wikibugs>	 (03Merged) 10jenkins-bot: Document that IP reveal permissions can't just be reassigned [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154308 (https://phabricator.wikimedia.org/T396217) (owner: 10Tchanders)
[13:04:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[13:04:29] <wikibugs>	 (03Merged) 10jenkins-bot: Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders)
[13:04:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[13:04:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877#11138880 (10ssingh) >>! In T300877#11130890, @ayounsi wrote: >> the idea is that static routes should help save us in that situation >  > That would only...
[13:04:59] <logmsgbot>	 !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1154308|Document that IP reveal permissions can't just be reassigned (T396217)]], [[gerrit:1180532|Enable temporary accounts on remaining small-sized projects (T402181)]]
[13:05:05] <stashbot>	 T396217: Document that groups with IP reveal rights must not be changed without making changes to the cache for Special:GlobalContributions - https://phabricator.wikimedia.org/T396217
[13:05:05] <stashbot>	 T402181: Deploy Temporary accounts to all remaining small-sized projects - https://phabricator.wikimedia.org/T402181
[13:05:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Right now with maps-test serving all traffic, we have 87 connections, but I fully agree, we have the capacity and let's use it." [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[13:05:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3005.wikimedia.org - jmm@cumin2002"
[13:05:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:05:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh3005.wikimedia.org on all recursors
[13:05:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh3005.wikimedia.org on all recursors
[13:05:38] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host deploy2003.codfw.wmnet with OS bookworm
[13:05:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11138894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host deploy2003.codfw.wmnet with OS bookworm
[13:05:53] <wikibugs>	 (03PS2) 10Muehlenhoff: Apply the durum role on durum3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (https://phabricator.wikimedia.org/T402259)
[13:06:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3005.wikimedia.org - jmm@cumin2002"
[13:06:00] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum2001.codfw.wmnet with OS trixie
[13:06:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3005.wikimedia.org - jmm@cumin2002"
[13:06:08] <wikibugs>	 (03CR) 10Ssingh: Apply the durum role on durum3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[13:07:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11138908 (10Jclark-ctr) @elukey Confirmed same issue; connected to iDRAC via SSH tunnel, logged in, and reset BMC under Maintenance → BMC Reset → Selected Unit Reset. i...
[13:07:39] <kart_>	 Sure. I can deploy JustHannah anzx 
[13:07:45] <Emperor>	 !log install libpython3.9-dbg python3.9-dbg on ms-fe2016 for debugging
[13:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh3005.wikimedia.org with OS bookworm
[13:08:30] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11138912 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host doh3005.wikimedia.org with OS bookworm
[13:09:13] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum4001.ulsfo.wmnet with OS trixie
[13:10:08] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11138918 (10MoritzMuehlenhoff)
[13:10:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:10:12] <wikibugs>	 (03PS1) 10Btullis: Remove references to dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/1184070 (https://phabricator.wikimedia.org/T398438)
[13:10:12] <wikibugs>	 (03CR) 10Muehlenhoff: Apply the durum role on durum3005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[13:10:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Apply the durum role on durum3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184069 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[13:11:05] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Remove references to dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/1184070 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[13:11:47] <logmsgbot>	 !log stran@deploy1003 tchanders, stran: Backport for [[gerrit:1154308|Document that IP reveal permissions can't just be reassigned (T396217)]], [[gerrit:1180532|Enable temporary accounts on remaining small-sized projects (T402181)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:11:49] <wikibugs>	 (03PS1) 10Tiziano Fogli: monitoring services: add migration task T228380 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1184071 (https://phabricator.wikimedia.org/T395443)
[13:11:51] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add four hadoop workers from repurposed dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/1184068 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[13:11:57] <stashbot>	 T396217: Document that groups with IP reveal rights must not be changed without making changes to the cache for Special:GlobalContributions - https://phabricator.wikimedia.org/T396217
[13:11:57] <stashbot>	 T402181: Deploy Temporary accounts to all remaining small-sized projects - https://phabricator.wikimedia.org/T402181
[13:12:22] <Tran>	 Testing my patches now
[13:12:52] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add four hadoop workers from repurposed dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/1184068 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[13:13:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: openstack: add wmcs-server-id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) (owner: 10Filippo Giunchedi)
[13:15:10] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:15:17] <sukhe>	 ^ expectedv
[13:16:53] <logmsgbot>	 !log stran@deploy1003 tchanders, stran: Continuing with sync
[13:17:09] <Tran>	 Done testing, finishing sync
[13:18:46] <wikibugs>	 (03PS9) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851)
[13:18:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P82392 and previous config saved to /var/cache/conftool/dbconfig/20250902-131845-fceratto.json
[13:19:54] <wikibugs>	 (03CR) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[13:22:12] <logmsgbot>	 !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154308|Document that IP reveal permissions can't just be reassigned (T396217)]], [[gerrit:1180532|Enable temporary accounts on remaining small-sized projects (T402181)]] (duration: 17m 13s)
[13:23:10] <Tran>	 My deploy is done, thanks for your patience!
[13:23:14] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage
[13:23:16] <kart_>	 JustHannah: will start with your patch.
[13:23:19] <kart_>	 Tran: Thanks
[13:23:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183741 (https://phabricator.wikimedia.org/T362324) (owner: 10Hokwelum)
[13:24:09] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on deploy2003.codfw.wmnet with reason: host reimage
[13:24:12] <JustHannah>	 kart_:okay!
[13:24:48] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgPHPSessionHandling to 'disable' on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183741 (https://phabricator.wikimedia.org/T362324) (owner: 10Hokwelum)
[13:25:12] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1183741|Set $wgPHPSessionHandling to 'disable' on group1 wikis (T362324)]]
[13:28:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh3005.wikimedia.org with reason: host reimage
[13:29:16] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T228380 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1184071 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli)
[13:29:36] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[13:29:38] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage
[13:31:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.23% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:31:59] <logmsgbot>	 !log kartik@deploy1003 hokwelum, kartik: Backport for [[gerrit:1183741|Set $wgPHPSessionHandling to 'disable' on group1 wikis (T362324)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:32:41] <kart_>	 JustHannah: you can test patch now
[13:33:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh3005.wikimedia.org with reason: host reimage
[13:33:12] <JustHannah>	 okay! Thank you!
[13:33:53] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage
[13:33:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T401906)', diff saved to https://phabricator.wikimedia.org/P82393 and previous config saved to /var/cache/conftool/dbconfig/20250902-133352-fceratto.json
[13:34:08] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance
[13:34:21] <wikibugs>	 (03PS1) 10Dreamy Jazz: Add the CheckUserMatchSuggestedInvestigationsSignalAgainstUser hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184078 (https://phabricator.wikimedia.org/T403111)
[13:34:37] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance
[13:35:06] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2208.codfw.wmnet with reason: Maintenance
[13:35:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T401906)', diff saved to https://phabricator.wikimedia.org/P82394 and previous config saved to /var/cache/conftool/dbconfig/20250902-133513-fceratto.json
[13:35:17] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[13:36:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.46% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:36:24] <JustHannah>	 kart_: looks good!
[13:36:31] <kart_>	 cool. deploying..
[13:36:35] <logmsgbot>	 !log kartik@deploy1003 hokwelum, kartik: Continuing with sync
[13:36:49] <logmsgbot>	 !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on deploy2003.codfw.wmnet with reason: host reimage
[13:37:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T401906)', diff saved to https://phabricator.wikimedia.org/P82395 and previous config saved to /var/cache/conftool/dbconfig/20250902-133736-fceratto.json
[13:38:05] <Lucas_WMDE>	 o/
[13:38:12] <Lucas_WMDE>	 I’d be available now if needed :)
[13:39:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184078 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz)
[13:40:37] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage
[13:42:15] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11139117 (10Gehel) p:05Triage→03High
[13:42:16] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T400968#11139118 (10Gehel) p:05Triage→03High
[13:42:26] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11139119 (10Gehel) p:05Triage→03High
[13:42:36] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PybalBackendDown (instance cirrussearch2091:0) - https://phabricator.wikimedia.org/T399161#11139124 (10Gehel) p:05Triage→03High
[13:42:41] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183741|Set $wgPHPSessionHandling to 'disable' on group1 wikis (T362324)]] (duration: 17m 28s)
[13:42:44] <stashbot>	 T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324
[13:42:55] <kart_>	 JustHannah: done.
[13:43:01] <kart_>	 anzx: your patch is next.
[13:43:01] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886#11139132 (10Gehel)
[13:43:07] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T400968#11139136 (10Gehel)
[13:43:12] <anzx>	 kart_: ok
[13:43:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11139134 (10Gehel)
[13:43:29] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: PybalBackendDown (instance cirrussearch2091:0) - https://phabricator.wikimedia.org/T399161#11139142 (10Gehel)
[13:43:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183662 (https://phabricator.wikimedia.org/T402755) (owner: 10Anzx)
[13:44:26] <JustHannah>	 kart_: Thank you so much!
[13:44:36] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[13:44:56] <wikibugs>	 (03Merged) 10jenkins-bot: idwiki: Add extended confirmed usergroup & restriction level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183662 (https://phabricator.wikimedia.org/T402755) (owner: 10Anzx)
[13:45:19] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1183662|idwiki: Add extended confirmed usergroup & restriction level (T402755)]]
[13:45:22] <stashbot>	 T402755: Enable extended confirmed user at Indonesian Wikipedia (id.wp) - https://phabricator.wikimedia.org/T402755
[13:45:26] <wikibugs>	 (03PS10) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851)
[13:45:56] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2001.codfw.wmnet with OS trixie
[13:46:55] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[13:47:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh3005.wikimedia.org with OS bookworm
[13:47:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh3005.wikimedia.org
[13:48:12] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11139195 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host doh3005.wikimedia.org with OS bookworm completed: - doh3005...
[13:48:47] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[13:48:55] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1190 (T402925)', diff saved to https://phabricator.wikimedia.org/P82396 and previous config saved to /var/cache/conftool/dbconfig/20250902-134854-ladsgroup.json
[13:48:58] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[13:49:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Apply the wikidough role on doh3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184080 (https://phabricator.wikimedia.org/T402259)
[13:50:10] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:50:23] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.rename from dumpsdata1004 to an-worker1233
[13:50:43] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.netbox
[13:50:49] <wikibugs>	 (03CR) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[13:50:55] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Remove references to dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/1184070 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[13:52:15] <logmsgbot>	 !log kartik@deploy1003 kartik, anzx: Backport for [[gerrit:1183662|idwiki: Add extended confirmed usergroup & restriction level (T402755)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:52:18] <stashbot>	 T402755: Enable extended confirmed user at Indonesian Wikipedia (id.wp) - https://phabricator.wikimedia.org/T402755
[13:52:19] <anzx>	 kart_: looks good, ok to sync 
[13:52:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P82397 and previous config saved to /var/cache/conftool/dbconfig/20250902-135243-fceratto.json
[13:52:52] <kart_>	 anzx: cool. That's fast.
[13:52:58] <logmsgbot>	 !log kartik@deploy1003 kartik, anzx: Continuing with sync
[13:53:22] <anzx>	 yeah change was working more than two minutes ago 
[13:53:32] <wikibugs>	 (03PS1) 10Bking: refinery: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1184081 (https://phabricator.wikimedia.org/T401116)
[13:53:34] <kart_>	 :)
[13:54:51] <wikibugs>	 (03Abandoned) 10Ayounsi: esams routed ganeti: add v4 and v6 IP/range [puppet] - 10https://gerrit.wikimedia.org/r/1180130 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi)
[13:55:19] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1184081 (https://phabricator.wikimedia.org/T401116) (owner: 10Bking)
[13:55:36] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002"
[13:55:55] <logmsgbot>	 !log jhancock@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002"
[13:55:56] <logmsgbot>	 !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host deploy2003.codfw.wmnet with OS bookworm
[13:55:58] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] refinery: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1184081 (https://phabricator.wikimedia.org/T401116) (owner: 10Bking)
[13:56:03] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11139244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host deploy2003.codfw.wmnet with OS bookworm completed: - deploy2003 (**PASS**)   -...
[13:56:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11139247 (10Jhancock.wm) 05Open→03Resolved
[13:56:30] <logmsgbot>	 btullis@cumin1003 rename (PID 688424) is awaiting input
[13:56:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Add EFI variant of raid5-4dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1184082 (https://phabricator.wikimedia.org/T381565)
[13:57:16] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11139253 (10Jhancock.wm) @Clement_Goubert @jasmine_ this is complete and ready for y'all!
[13:57:36] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4001.ulsfo.wmnet with OS trixie
[13:57:49] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Looks ok but the amount of corner cases is becoming worrisome" [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[13:58:10] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183662|idwiki: Add extended confirmed usergroup & restriction level (T402755)]] (duration: 12m 51s)
[13:58:13] <stashbot>	 T402755: Enable extended confirmed user at Indonesian Wikipedia (id.wp) - https://phabricator.wikimedia.org/T402755
[13:58:37] <Dreamy_Jazz>	 I can self deploy my one
[13:59:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Update partman config for the new maps nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1184084 (https://phabricator.wikimedia.org/T381565)
[13:59:43] <Dreamy_Jazz>	 kart_: Are you going to deploy your change now?
[14:00:04] <jouncebot>	 Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1400)
[14:00:07] <kart_>	 Dreamy_Jazz: yes
[14:00:11] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:00:28] <wikibugs>	 10SRE-swift-storage, 10Observability-Alerting: Remove load_average check for ms-be/thanos-be - https://phabricator.wikimedia.org/T370526#11139266 (10tappof)
[14:00:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183703 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas)
[14:01:13] <wikibugs>	 07Puppet, 10MW-on-K8s, 10Observability-Alerting: Clean up "git repo needs merge" checks - https://phabricator.wikimedia.org/T370530#11139269 (10tappof)
[14:01:45] <wikibugs>	 (03CR) 10Elukey: "I was about to say that there is still a little nit to solve:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[14:02:20] <wikibugs>	 (03Merged) 10jenkins-bot: ContentTranslation: Add cxserver host for server-side requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183703 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas)
[14:02:24] <anzx>	 kart_: thanks for deploying 
[14:02:44] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1183703|ContentTranslation: Add cxserver host for server-side requests (T386131)]]
[14:02:47] <stashbot>	 T386131: Newly translated sections of articles always placed at the bottom - https://phabricator.wikimedia.org/T386131
[14:03:07] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:03:59] <wikibugs>	 (03CR) 10Aklapper: [C:03+2] Remove fallback for Asturian language [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1183280 (https://phabricator.wikimedia.org/T292750) (owner: 10Pppery)
[14:04:17] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Remove fallback for Asturian language [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1183280 (https://phabricator.wikimedia.org/T292750) (owner: 10Pppery)
[14:04:19] <wikibugs>	 (03PS3) 10Btullis: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:04:19] <wikibugs>	 (03PS3) 10Btullis: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:04:19] <wikibugs>	 (03PS3) 10Btullis: dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:07:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P82398 and previous config saved to /var/cache/conftool/dbconfig/20250902-140751-fceratto.json
[14:08:42] <wikibugs>	 (03PS4) 10Btullis: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:08:42] <wikibugs>	 (03PS4) 10Btullis: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:09:17] <XioNoX>	 !log eqsin: remove lvs static routes - T300877
[14:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:20] <stashbot>	 T300877: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877
[14:09:35] <logmsgbot>	 !log kartik@deploy1003 kartik, ngkountas: Backport for [[gerrit:1183703|ContentTranslation: Add cxserver host for server-side requests (T386131)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:09:38] <stashbot>	 T386131: Newly translated sections of articles always placed at the bottom - https://phabricator.wikimedia.org/T386131
[14:10:08] <wikibugs>	 (03PS5) 10Btullis: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:10:08] <wikibugs>	 (03PS5) 10Btullis: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:12:03] <wikibugs>	 (03PS6) 10Btullis: dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:12:03] <wikibugs>	 (03PS6) 10Btullis: dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:12:50] <anzx>	 kart_: just one question, even if i delete saved or in progress translation , it still appears when I check it again.
[14:13:56] <kart_>	 anzx: in CX?
[14:14:10] <anzx>	 yes
[14:14:53] <kart_>	 Need to check. Can you file a task with details?
[14:15:18] <anzx>	 kart_: yes I file one tomorrow, thanks
[14:15:24] <XioNoX>	 !log ulsfo: remove lvs static routes - T300877
[14:15:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:27] <stashbot>	 T300877: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877
[14:21:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11139452 (10elukey) 05Open→03Resolved @Jclark-ctr confirmed that it works, thanks a lot!
[14:22:02] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add EFI variant of raid5-4dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1184082 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[14:22:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T401906)', diff saved to https://phabricator.wikimedia.org/P82399 and previous config saved to /var/cache/conftool/dbconfig/20250902-142259-fceratto.json
[14:23:03] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[14:23:07] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Adding also Yiannis: we are going to use raid5 for the new maps nodes, it will give us more space if needed for the future. Raid 5 may be " [puppet] - 10https://gerrit.wikimedia.org/r/1184084 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[14:23:15] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2218.codfw.wmnet with reason: Maintenance
[14:23:16] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "Looks good, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184059 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:23:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T401906)', diff saved to https://phabricator.wikimedia.org/P82400 and previous config saved to /var/cache/conftool/dbconfig/20250902-142322-fceratto.json
[14:23:31] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184060 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:24:43] <wikibugs>	 (03PS1) 10Mszwarc: Revert "UIC: Avoid fetching revisions from wikis to make list of active wikis" [extensions/CheckUser] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184087
[14:25:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T401906)', diff saved to https://phabricator.wikimedia.org/P82401 and previous config saved to /var/cache/conftool/dbconfig/20250902-142545-fceratto.json
[14:26:00] <wikibugs>	 (03Abandoned) 10Mszwarc: Revert "UIC: Avoid fetching revisions from wikis to make list of active wikis" [extensions/CheckUser] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184087 (owner: 10Mszwarc)
[14:26:26] <XioNoX>	 !log codfw: remove lvs static routes - T300877
[14:26:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:29] <stashbot>	 T300877: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877
[14:26:39] <wikibugs>	 (03PS5) 10Btullis: dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:27:46] <wikibugs>	 (03PS1) 10Mszwarc: Revert "UIC: Avoid fetching revisions from wikis to make list of active wikis" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184089
[14:28:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877#11139484 (10ayounsi)
[14:28:27] <wikibugs>	 (03CR) 10Bking: [C:03+2] refinery: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1184081 (https://phabricator.wikimedia.org/T401116) (owner: 10Bking)
[14:29:22] <kart_>	 We're still testing the config patch..
[14:30:07] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1430)
[14:31:05] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Upgrade the dse-k8s-codfw cluster to version 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1184043 (https://phabricator.wikimedia.org/T396478) (owner: 10Btullis)
[14:32:54] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[14:33:00] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] JunOS IBGP: adjust template to work with updated data from plugin [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[14:34:31] <wikibugs>	 (03PS1) 10Federico Ceratto: es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1184092 (https://phabricator.wikimedia.org/T402859)
[14:34:37] <wikibugs>	 (03PS1) 10Federico Ceratto: instances.yaml: Add es2049 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1184091 (https://phabricator.wikimedia.org/T402859)
[14:40:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P82402 and previous config saved to /var/cache/conftool/dbconfig/20250902-144053-fceratto.json
[14:44:47] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:46:04] <wikibugs>	 (03CR) 10STran: [C:03+1] Revert "UIC: Avoid fetching revisions from wikis to make list of active wikis" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184089 (owner: 10Mszwarc)
[14:47:22] <Dreamy_Jazz>	 kart_: Still testing?
[14:47:45] <kart_>	 Dreamy_Jazz: sadly, yes.
[14:49:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add EFI variant of raid5-4dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1184082 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[14:49:55] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:53:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Update partman config for the new maps nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1184084 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[14:53:20] <logmsgbot>	 !log kartik@deploy1003 Sync cancelled.
[14:53:51] <wikibugs>	 (03PS3) 10Cathal Mooney: JunOS IBGP: adjust template to work with updated data from plugin [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577)
[14:53:51] <wikibugs>	 (03PS1) 10KartikMistry: Revert "ContentTranslation: Add cxserver host for server-side requests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184096
[14:54:12] <wikibugs>	 (03PS2) 10Cathal Mooney: WMF-Plugin: Include the BGP role when exposing the IGBP data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1182796 (https://phabricator.wikimedia.org/T402577)
[14:54:20] <kart_>	 Dreamy_Jazz: I've to revert as well.
[14:54:25] <Dreamy_Jazz>	 Okay
[14:54:44] <Dreamy_Jazz>	 Though that shouldn't need a full scap backport because the sync never went beyond the test servers
[14:55:02] <Dreamy_Jazz>	 And I'll overwrite what is on those when I deploy my change
[14:55:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184096 (owner: 10KartikMistry)
[14:55:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WMF-Plugin: Include the BGP role when exposing the IGBP data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1182796 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney)
[14:56:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P82403 and previous config saved to /var/cache/conftool/dbconfig/20250902-145601-fceratto.json
[14:56:11] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184061 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene)
[14:56:24] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ContentTranslation: Add cxserver host for server-side requests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184096 (owner: 10KartikMistry)
[14:56:48] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1184096|Revert "ContentTranslation: Add cxserver host for server-side requests"]]
[15:00:05] <jouncebot>	 jelto, arnoldokoth, and mutante: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1500).
[15:03:06] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Add the CheckUserMatchSuggestedInvestigationsSignalAgainstUser hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184078 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz)
[15:03:17] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] provision: poll for reboot via Redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 (owner: 10JHathaway)
[15:03:38] <logmsgbot>	 !log kartik@deploy1003 kartik: Backport for [[gerrit:1184096|Revert "ContentTranslation: Add cxserver host for server-side requests"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:04:01] <logmsgbot>	 !log kartik@deploy1003 kartik: Continuing with sync
[15:04:36] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[15:05:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add the CheckUserMatchSuggestedInvestigationsSignalAgainstUser hook [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184078 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz)
[15:06:39] <logmsgbot>	 !log brennen@deploy1003 Started deploy [phabricator/deployment@6e0b4b1]: deploy phab2002 for T403494
[15:06:42] <stashbot>	 T403494: Deploy Phabricator/Phorge 2025-09-02 - https://phabricator.wikimedia.org/T403494
[15:07:22] <logmsgbot>	 !log brennen@deploy1003 Finished deploy [phabricator/deployment@6e0b4b1]: deploy phab2002 for T403494 (duration: 00m 43s)
[15:07:41] <logmsgbot>	 !log brennen@deploy1003 Started deploy [phabricator/deployment@6e0b4b1]: deploy phab1004 for T403494
[15:08:24] <logmsgbot>	 !log brennen@deploy1003 Finished deploy [phabricator/deployment@6e0b4b1]: deploy phab1004 for T403494 (duration: 00m 43s)
[15:08:40] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:45] <wikibugs>	 (03PS1) 10Ladsgroup: Stop writing to categorylinks old in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184097 (https://phabricator.wikimedia.org/T399579)
[15:09:31] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184096|Revert "ContentTranslation: Add cxserver host for server-side requests"]] (duration: 12m 42s)
[15:10:15] <wikibugs>	 (03CR) 10JHathaway: sre.hosts.provision: update cookbook for Dell iDRAC 10 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[15:10:20] <kart_>	 finally.
[15:11:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T401906)', diff saved to https://phabricator.wikimedia.org/P82404 and previous config saved to /var/cache/conftool/dbconfig/20250902-151108-fceratto.json
[15:11:12] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[15:11:24] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2221.codfw.wmnet with reason: Maintenance
[15:11:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T401906)', diff saved to https://phabricator.wikimedia.org/P82405 and previous config saved to /var/cache/conftool/dbconfig/20250902-151131-fceratto.json
[15:11:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184078 (https://phabricator.wikimedia.org/T403111) (owner: 10Dreamy Jazz)
[15:11:55] <logmsgbot>	 jmm@cumin2002 reimage (PID 732606) is awaiting input
[15:11:58] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1184078|Add the CheckUserMatchSuggestedInvestigationsSignalAgainstUser hook (T403111)]]
[15:12:01] <stashbot>	 T403111: Suggested investigations: Define hooks to be used by private signal logic to define and implement a signal - https://phabricator.wikimedia.org/T403111
[15:13:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2011.codfw.wmnet with OS bookworm
[15:13:22] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11139778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps2011.codfw.wmnet with OS bookworm
[15:13:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T401906)', diff saved to https://phabricator.wikimedia.org/P82406 and previous config saved to /var/cache/conftool/dbconfig/20250902-151354-fceratto.json
[15:16:12] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1184078|Add the CheckUserMatchSuggestedInvestigationsSignalAgainstUser hook (T403111)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:16:55] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[15:18:34] <wikibugs>	 (03CR) 10FNegri: openstack: add wmcs-server-id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) (owner: 10Filippo Giunchedi)
[15:19:18] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1163 gradually with 4 steps - Maint over
[15:20:22] <wikibugs>	 (03CR) 10Jforrester: "🎉" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184097 (https://phabricator.wikimedia.org/T399579) (owner: 10Ladsgroup)
[15:21:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: PSU issue on es2055 - https://phabricator.wikimedia.org/T403243#11139863 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated the cable and it's normalized. should be fine and not require any other hands on it.
[15:22:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403356#11139866 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[15:22:02] <wikibugs>	 (03CR) 10FNegri: openstack: add wmcs-server-id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184040 (https://phabricator.wikimedia.org/T402407) (owner: 10Filippo Giunchedi)
[15:22:24] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184078|Add the CheckUserMatchSuggestedInvestigationsSignalAgainstUser hook (T403111)]] (duration: 10m 25s)
[15:22:27] <stashbot>	 T403111: Suggested investigations: Define hooks to be used by private signal logic to define and implement a signal - https://phabricator.wikimedia.org/T403111
[15:26:39] <wikibugs>	 06SRE, 10SRE-Access-Requests: Update SSH key for Connie Chen - https://phabricator.wikimedia.org/T403242#11139889 (10cchen) Thank you @JMeybohm!
[15:29:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P82408 and previous config saved to /var/cache/conftool/dbconfig/20250902-152902-fceratto.json
[15:29:36] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:29:55] <wikibugs>	 (03CR) 10David Caro: [C:03+2] "Lets give this a try, I'll remove the other projects if they spam too much." [alerts] - 10https://gerrit.wikimedia.org/r/1182900 (https://phabricator.wikimedia.org/T402932) (owner: 10David Caro)
[15:31:26] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs: add object storage quota alerts [alerts] - 10https://gerrit.wikimedia.org/r/1182900 (https://phabricator.wikimedia.org/T402932) (owner: 10David Caro)
[15:33:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage
[15:33:40] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:35:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Also update partman recipe for new maps/eqiad nodes [puppet] - 10https://gerrit.wikimedia.org/r/1184102 (https://phabricator.wikimedia.org/T381565)
[15:38:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage
[15:39:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Also update partman recipe for new maps/eqiad nodes [puppet] - 10https://gerrit.wikimedia.org/r/1184102 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[15:42:10] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-ats (exit_code=0) rolling restart_daemons on A:cp
[15:44:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P82410 and previous config saved to /var/cache/conftool/dbconfig/20250902-154409-fceratto.json
[15:45:25] <wikibugs>	 (03PS3) 10Cathal Mooney: WMF-Plugin: Include the BGP role when exposing the IGBP data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1182796 (https://phabricator.wikimedia.org/T402577)
[15:50:54] <wikibugs>	 (03PS1) 10Nik Gkountas: ContentTranslation: Add cxserver host for server-side requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184112 (https://phabricator.wikimedia.org/T386131)
[15:51:15] <wikibugs>	 (03PS1) 10Jdlrobson: Send email alerts to Reading Web Slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1184113
[15:52:37] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Apply the wikidough role on doh3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184080 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[15:52:41] <wikibugs>	 (03PS2) 10Jdlrobson: Send email alerts to Reading Web Slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298)
[15:52:44] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Remove ncredir3003 [puppet] - 10https://gerrit.wikimedia.org/r/1184062 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[15:53:21] <wikibugs>	 (03PS1) 10DLynch: Edit check: set up the tone check a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184115 (https://phabricator.wikimedia.org/T389231)
[15:55:52] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus: Stop using auto_expand_replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182192 (https://phabricator.wikimedia.org/T402627)
[15:55:56] <wikibugs>	 (03CR) 10KartikMistry: [C:03+1] ContentTranslation: Add cxserver host for server-side requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184112 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas)
[15:56:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2011.codfw.wmnet with OS bookworm
[15:56:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Apply the wikidough role on doh3005 [puppet] - 10https://gerrit.wikimedia.org/r/1184080 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff)
[15:56:33] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11140029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps2011.codfw.wmnet with OS bookworm completed: - maps2011 (**PASS**)   - Downt...
[15:59:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T401906)', diff saved to https://phabricator.wikimedia.org/P82412 and previous config saved to /var/cache/conftool/dbconfig/20250902-155918-fceratto.json
[15:59:25] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[15:59:35] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2222.codfw.wmnet with reason: Maintenance
[15:59:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T401906)', diff saved to https://phabricator.wikimedia.org/P82413 and previous config saved to /var/cache/conftool/dbconfig/20250902-155942-fceratto.json
[16:00:05] <jouncebot>	 jhathaway and moritzm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184115 (https://phabricator.wikimedia.org/T389231) (owner: 10DLynch)
[16:00:52] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Remove magru RIPE Atlas Anchor" [puppet] - 10https://gerrit.wikimedia.org/r/1184116
[16:01:49] <wikibugs>	 (03PS2) 10Ayounsi: Revert "Remove magru RIPE Atlas Anchor" [puppet] - 10https://gerrit.wikimedia.org/r/1184116
[16:01:54] <wikibugs>	 (03PS1) 10Krinkle: Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595)
[16:02:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T401906)', diff saved to https://phabricator.wikimedia.org/P82414 and previous config saved to /var/cache/conftool/dbconfig/20250902-160204-fceratto.json
[16:02:23] <wikibugs>	 (03CR) 10Sbisson: [C:03+1] ContentTranslation: Add cxserver host for server-side requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184112 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas)
[16:03:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[16:04:45] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1163 gradually with 4 steps - Maint over
[16:05:46] <wikibugs>	 (03PS2) 10Krinkle: Disable wmgUseMdotRouting on testwiki in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183700 (https://phabricator.wikimedia.org/T401595)
[16:05:46] <wikibugs>	 (03PS2) 10Krinkle: Disable wmgUseMdotRouting on Test Wikidata, Wikitech, and Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184117 (https://phabricator.wikimedia.org/T401595)
[16:06:52] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 doesn't come back up during reimage - https://phabricator.wikimedia.org/T403375#11140060 (10Papaul) I took a look at the node, we do have a backplane issue see error below . The server is not coming up after a reboot.  ` The System Configuration Check operation result...
[16:09:57] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 doesn't come back up during reimage - https://phabricator.wikimedia.org/T403375#11140073 (10RobH) a:03RobH So that means a bad backplane or mainboard (likely backplane).  I'll steal this task and open a support ticket to have a tech dispatched with a replacement part.
[16:09:59] <Emperor>	 !oncall-now
[16:09:59] <sirenbot>	 Oncall now for team SRE, rotation business_hours:
[16:09:59] <sirenbot>	 m.utante, u.random
[16:13:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2011.codfw.wmnet with OS bookworm
[16:13:48] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11140092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps2011.codfw.wmnet with OS bookworm
[16:16:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1184116 (owner: 10Ayounsi)
[16:16:25] <icinga-wm>	 PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:17:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P82416 and previous config saved to /var/cache/conftool/dbconfig/20250902-161711-fceratto.json
[16:19:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11140117 (10VRiley-WMF) I created an account at Juniper, tried to open a support case for it for me to get added, however I was unable to do that. Notified @RobH and he said he'd look into it.
[16:22:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877#11140127 (10ssingh) Thanks for taking care of this @ayounsi! We will update this task when we are ready to remove the `eqiad` ones.
[16:24:48] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Revert "Remove magru RIPE Atlas Anchor" [puppet] - 10https://gerrit.wikimedia.org/r/1184116 (owner: 10Ayounsi)
[16:25:54] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:27:06] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[16:29:53] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2042.codfw.wmnet
[16:29:56] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cp2042.codfw.wmnet
[16:32:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P82417 and previous config saved to /var/cache/conftool/dbconfig/20250902-163219-fceratto.json
[16:33:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage
[16:34:15] <wikibugs>	 (03PS1) 10DLynch: Edit check: deploy tone a/b test to frwiki, jawiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184120 (https://phabricator.wikimedia.org/T389231)
[16:36:31] <wikibugs>	 (03Abandoned) 10Ebernhardson: cirrus: Stop using auto_expand_replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182192 (https://phabricator.wikimedia.org/T402627) (owner: 10Ebernhardson)
[16:38:35] <wikibugs>	 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11140228 (10xcollazo) CC @BTullis
[16:38:36] <wikibugs>	 (03PS2) 10DLynch: Edit check: log to VEFU if a tone check would have been shown if not for the a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184121 (https://phabricator.wikimedia.org/T394952)
[16:38:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Edit check: log to VEFU if a tone check would have been shown if not for the a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184121 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch)
[16:38:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage
[16:41:55] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T402925)', diff saved to https://phabricator.wikimedia.org/P82418 and previous config saved to /var/cache/conftool/dbconfig/20250902-164155-ladsgroup.json
[16:41:59] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[16:43:03] <wikibugs>	 (03PS1) 10Sbisson: CxServerClient: Log url instead of relative path upon failure [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184122 (https://phabricator.wikimedia.org/T386131)
[16:44:03] <icinga-wm>	 RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 33.44 ms
[16:44:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ContentTranslation] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184122 (https://phabricator.wikimedia.org/T386131) (owner: 10Sbisson)
[16:44:56] <wikibugs>	 (03PS3) 10DLynch: Edit check: log to VEFU if a tone check would have been shown if not for the a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184121 (https://phabricator.wikimedia.org/T394952)
[16:45:03] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11140268 (10Jhancock.wm) connected. using the serial connection for the ps1-b7-codfw temporarily. if we need a more permanent line, lmk and i can run it.
[16:47:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T401906)', diff saved to https://phabricator.wikimedia.org/P82419 and previous config saved to /var/cache/conftool/dbconfig/20250902-164727-fceratto.json
[16:47:31] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[16:49:43] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[16:51:33] <wikibugs>	 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11140320 (10Jhancock.wm) hey! i got a thing mixed up but everything is good now. my bad. please let me know if you need anything else!
[16:51:50] <wikibugs>	 (03PS1) 10DLynch: Edit check: log to VEFU if a tone check would have been shown if not for the a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184124 (https://phabricator.wikimedia.org/T394952)
[16:56:49] <wikibugs>	 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11140332 (10Jgreen) >>! In T400275#11140320, @Jhancock.wm wrote: > hey! i got a thing mixed up but everything is good now. my bad. please let me know if you need anything else!  Co...
[16:57:03] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P82420 and previous config saved to /var/cache/conftool/dbconfig/20250902-165702-ladsgroup.json
[16:57:56] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host maps2011.codfw.wmnet with OS bookworm
[16:58:26] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11140333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps2011.codfw.wmnet with OS bookworm completed: - maps2011 (**PASS**)   - Downt...
[16:58:29] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11140334 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps2011.codfw.wmnet with OS bookworm executed with errors: - maps2011 (**FAIL**...
[16:58:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184112 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas)
[16:58:52] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11140340 (10phaultfinder)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1700)
[17:01:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:03:59] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11140361 (10phaultfinder)
[17:04:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[17:07:34] <wikibugs>	 (03PS1) 10Jasmine: switchdc: remove mw-wikifunctions discovery services following move to k8s ingress [cookbooks] - 10https://gerrit.wikimedia.org/r/1184125 (https://phabricator.wikimedia.org/T397874)
[17:09:33] <wikibugs>	 (03CR) 10VolkerE: Update vector search config with new wgVectorTypeahead (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang)
[17:10:13] <wikibugs>	 (03CR) 10Herron: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez)
[17:12:11] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P82421 and previous config saved to /var/cache/conftool/dbconfig/20250902-171210-ladsgroup.json
[17:13:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3005']
[17:13:40] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595)
[17:13:50] <wikibugs>	 (03CR) 10Herron: [C:03+2] profile::pyrra::filesystem::slo: add new slo define [puppet] - 10https://gerrit.wikimedia.org/r/1182886 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron)
[17:14:07] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ganeti3005']
[17:14:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184121 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch)
[17:14:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184124 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch)
[17:14:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183154 (https://phabricator.wikimedia.org/T403127) (owner: 10Jdlrobson)
[17:15:43] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti3005']
[17:16:00] <wikibugs>	 (03CR) 10VolkerE: Remove deprecated search config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182875 (https://phabricator.wikimedia.org/T402208) (owner: 10Bernard Wang)
[17:19:56] <logmsgbot>	 robh@cumin2002 upgrade-firmware (PID 799202) is awaiting input
[17:20:07] <wikibugs>	 (03CR) 10VolkerE: [C:04-1] Send email alerts to Reading Web Slack channel (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298) (owner: 10Jdlrobson)
[17:27:18] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T402925)', diff saved to https://phabricator.wikimedia.org/P82422 and previous config saved to /var/cache/conftool/dbconfig/20250902-172718-ladsgroup.json
[17:27:22] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[17:27:34] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[17:27:42] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1199 (T402925)', diff saved to https://phabricator.wikimedia.org/P82423 and previous config saved to /var/cache/conftool/dbconfig/20250902-172741-ladsgroup.json
[17:28:21] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti3005']
[17:28:49] <wikibugs>	 (03PS3) 10Jdlrobson: Send email alerts to Reading Web "Performance Alert" Slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298)
[17:29:16] <wikibugs>	 (03PS4) 10Jdlrobson: Send email alerts to Reading Web "Performance Alert" Slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298)
[17:29:33] <wikibugs>	 (03PS5) 10Jdlrobson: Send email alerts to Reading Web "Performance Alert" Slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298)
[17:29:36] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[17:29:39] <wikibugs>	 (03CR) 10Jdlrobson: Send email alerts to Reading Web "Performance Alert" Slack channel (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298) (owner: 10Jdlrobson)
[17:30:06] <wikibugs>	 (03PS1) 10Bking: stat hosts: alert on I/O stalls [alerts] - 10https://gerrit.wikimedia.org/r/1184128 (https://phabricator.wikimedia.org/T401589)
[17:36:37] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11140566 (10Krinkle)
[17:38:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson)
[17:39:44] <jinxer-wm>	 FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[17:40:36] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops: ganeti3005 doesn't come back up during reimage - https://phabricator.wikimedia.org/T403375#11140589 (10RobH) a:05RobH→03MoritzMuehlenhoff After updating the idrac, bios, and backplane firmware and resetting & then allowing the system to post a few times, it hasn't shown the...
[17:40:55] <wikibugs>	 (03PS3) 10Herron: pyrra: citoid enable revision param [puppet] - 10https://gerrit.wikimedia.org/r/1182898 (https://phabricator.wikimedia.org/T400073)
[17:41:07] <wikibugs>	 (03PS2) 10Krinkle: varnish: Enable unified routing on test.wikidata, wikitech, officewiki [puppet] - 10https://gerrit.wikimedia.org/r/1184126 (https://phabricator.wikimedia.org/T401595)
[17:41:07] <wikibugs>	 (03PS1) 10Krinkle: varnish: Enable unified routing on mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1184130 (https://phabricator.wikimedia.org/T403510)
[17:43:01] <wikibugs>	 (03PS1) 10Krinkle: Disable wmgUseMdotRouting on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184131 (https://phabricator.wikimedia.org/T403510)
[17:44:36] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[17:46:24] <wikibugs>	 (03CR) 10Herron: [C:03+2] pyrra: citoid enable revision param [puppet] - 10https://gerrit.wikimedia.org/r/1182898 (https://phabricator.wikimedia.org/T400073) (owner: 10Herron)
[17:51:24] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:51:52] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[18:00:05] <jouncebot>	 dancy and andre: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1800).
[18:00:14] <dancy>	 o/
[18:00:25] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:02:17] <dancy>	 Pressing the button
[18:02:33] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184132 (https://phabricator.wikimedia.org/T396378)
[18:02:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184132 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot)
[18:03:27] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184132 (https://phabricator.wikimedia.org/T396378) (owner: 10TrainBranchBot)
[18:04:44] <jinxer-wm>	 FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[18:06:05] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:06:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:15:00] <logmsgbot>	 !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.17  refs T396378
[18:15:04] <stashbot>	 T396378: 1.45.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T396378
[18:16:55] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11140708 (10Krinkle)
[18:19:55] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:21:49] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:31:14] <Amir1>	 jouncebot: nowandnext
[18:31:14] <jouncebot>	 For the next 1 hour(s) and 28 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T1800)
[18:31:14] <jouncebot>	 In 1 hour(s) and 28 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T2000)
[18:32:54] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[18:33:09] <Amir1>	 dancy: hiiiii, sorry to bother, when you're done with the deploy and nothing is needed and all okay (no rush, totally). Would you mind giving me a heads up so I quickly deploy something?
[18:33:19] <dancy>	 Amir1: All yours!
[18:33:30] <Amir1>	 oh nice
[18:33:39] <Amir1>	 party time
[18:33:57] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Stop writing to categorylinks old in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184097 (https://phabricator.wikimedia.org/T399579) (owner: 10Ladsgroup)
[18:34:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184097 (https://phabricator.wikimedia.org/T399579) (owner: 10Ladsgroup)
[18:34:53] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to categorylinks old in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184097 (https://phabricator.wikimedia.org/T399579) (owner: 10Ladsgroup)
[18:35:16] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1184097|Stop writing to categorylinks old in enwiki (T399579)]]
[18:35:19] <stashbot>	 T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579
[18:39:22] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1184097|Stop writing to categorylinks old in enwiki (T399579)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:41:51] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[18:44:41] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:47:14] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184097|Stop writing to categorylinks old in enwiki (T399579)]] (duration: 11m 57s)
[18:47:17] <stashbot>	 T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579
[18:49:55] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:54:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CheckUser] (wmf/1.45.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1184089 (owner: 10Mszwarc)
[18:57:40] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "if you search for es2049 in icinga.wikimedia.org, there is a massive disk space warning: https://icinga.wikimedia.org/cgi-bin/icinga/extin" [puppet] - 10https://gerrit.wikimedia.org/r/1184092 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[18:58:06] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "since in es2026 disk usage is 28% and in es2049 it's 95%" [puppet] - 10https://gerrit.wikimedia.org/r/1184092 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[18:58:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Novem Linguae - https://phabricator.wikimedia.org/T403336#11140824 (10Ottomata) Approved!
[19:01:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:04:06] <wikibugs>	 (03CR) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[19:04:36] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[19:06:05] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:09:13] <wikibugs>	 (03CR) 10JHathaway: sre.hosts.provision: update cookbook for Dell iDRAC 10 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[19:12:55] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:13:24] <wikibugs>	 (03CR) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[19:19:33] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:21:35] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:23:51] <wikibugs>	 (03PS11) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851)
[19:24:39] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[19:28:59] <wikibugs>	 (03CR) 10Elukey: [C:04-1] sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[19:31:06] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[19:31:40] <wikibugs>	 (03PS12) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851)
[19:31:55] <wikibugs>	 (03PS13) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851)
[19:32:20] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[19:32:31] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[19:32:51] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[19:33:03] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[19:33:33] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[19:36:48] <wikibugs>	 (03CR) 10Elukey: "Still see:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[19:37:47] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:38:04] <wikibugs>	 (03CR) 10JHathaway: sre.hosts.provision: update cookbook for Dell iDRAC 10 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[19:39:23] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:41:25] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:45:12] <wikibugs>	 (03PS1) 10Jforrester: [WIP] Disable ShortURL everywhere, without migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184153
[19:47:15] <wikibugs>	 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11141081 (10Jhancock.wm) i set it to the one i have that starts with a T. I can set it to something else if that one doesn't work for you, or you aren't sure which i'm talking about.
[19:48:00] <wikibugs>	 (03PS14) 10Elukey: WIP - sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851)
[19:48:07] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[19:49:00] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[19:52:25] <logmsgbot>	 elukey@cumin1003 provision (PID 726973) is awaiting input
[19:56:36] <wikibugs>	 (03CR) 10Ottomata: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez)
[19:58:56] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:59:50] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T2000)
[20:00:05] <jouncebot>	 danisztls, kemayo, stephanebisson, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:11] <danisztls>	 o/ I can self-deploy
[20:00:17] <Kemayo>	 o/ as can I
[20:00:42] <wikibugs>	 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11141178 (10ecarg) 05Open→03Resolved a:03ecarg Marking this as 'Resolved' because the inaugural board is s...
[20:00:44] <wikibugs>	 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11141181 (10Jgreen) >>! In T400275#11141081, @Jhancock.wm wrote: > i set it to the one i have that starts with a T. I can set it to something else if that one doesn't work for you,...
[20:01:19] <wikibugs>	 (03PS15) 10Elukey: WIP - sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851)
[20:01:22] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:01:32] <danisztls>	 I will start deploying my patch to not keep everyone waiting
[20:01:33] <wikibugs>	 (03PS2) 10Jforrester: [WIP] Disable ShortURL everywhere, without migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184153 (https://phabricator.wikimedia.org/T107188)
[20:01:34] <Kemayo>	 danisztls: you've just got one, so want to go first?
[20:01:41] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:01:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183753 (https://phabricator.wikimedia.org/T402915) (owner: 10DDesouza)
[20:02:10] <danisztls>	 Kemayo: yep
[20:02:43] <wikibugs>	 (03Merged) 10jenkins-bot: Pre-deploy Newcomers survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183753 (https://phabricator.wikimedia.org/T402915) (owner: 10DDesouza)
[20:03:08] <logmsgbot>	 !log dani@deploy1003 Started scap sync-world: Backport for [[gerrit:1183753|Pre-deploy Newcomers survey on enwiki (T402915)]]
[20:03:11] <stashbot>	 T402915: Newcomer survey: first test, then launch a quicksurvey - https://phabricator.wikimedia.org/T402915
[20:06:40] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2043.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:07:40] <wikibugs>	 (03PS16) 10Elukey: sre.hosts.provision: update cookbook for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851)
[20:08:31] <wikibugs>	 (03CR) 10Elukey: "Worked on cp2043, I'll try on other nodes too. Let me know if the code is sound, I had to add another workaround for a weird use case in t" [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey)
[20:08:47] <stephanebisson>	 jouncebot now
[20:08:48] <jouncebot>	 For the next 0 hour(s) and 51 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T2000)
[20:09:14] <logmsgbot>	 !log dani@deploy1003 dani: Backport for [[gerrit:1183753|Pre-deploy Newcomers survey on enwiki (T402915)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:09:17] <stashbot>	 T402915: Newcomer survey: first test, then launch a quicksurvey - https://phabricator.wikimedia.org/T402915
[20:09:46] <logmsgbot>	 !log dani@deploy1003 dani: Continuing with sync
[20:10:16] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184155 (https://phabricator.wikimedia.org/T128546)
[20:10:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184155 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[20:12:08] <wikibugs>	 (03CR) 10Scott French: [C:03+1] switchdc: remove mw-wikifunctions discovery services following move to k8s ingress [cookbooks] - 10https://gerrit.wikimedia.org/r/1184125 (https://phabricator.wikimedia.org/T397874) (owner: 10Jasmine)
[20:12:14] <wikibugs>	 (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184155 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[20:12:40] <wikibugs>	 (03PS1) 10BryanDavis: hcaptcha: Redirect / to mw.o project page [puppet] - 10https://gerrit.wikimedia.org/r/1184157
[20:12:40] <wikibugs>	 (03PS1) 10BryanDavis: hcaptcha: Respond with HTTP 405 to disallowed methods [puppet] - 10https://gerrit.wikimedia.org/r/1184158
[20:12:58] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184159 (https://phabricator.wikimedia.org/T128546)
[20:13:35] <wikibugs>	 (03PS1) 10DDesouza: Fix typo on newcomers survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184160 (https://phabricator.wikimedia.org/T402915)
[20:15:02] <logmsgbot>	 !log dani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183753|Pre-deploy Newcomers survey on enwiki (T402915)]] (duration: 11m 53s)
[20:15:05] <stashbot>	 T402915: Newcomer survey: first test, then launch a quicksurvey - https://phabricator.wikimedia.org/T402915
[20:15:32] <danisztls>	 Kemayo: all yours
[20:15:42] <Kemayo>	 danisztls: thanks
[20:16:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184115 (https://phabricator.wikimedia.org/T389231) (owner: 10DLynch)
[20:16:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184121 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch)
[20:17:06] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T402925)', diff saved to https://phabricator.wikimedia.org/P82426 and previous config saved to /var/cache/conftool/dbconfig/20250902-201705-ladsgroup.json
[20:17:09] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[20:18:04] <wikibugs>	 (03Merged) 10jenkins-bot: Edit check: set up the tone check a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184115 (https://phabricator.wikimedia.org/T389231) (owner: 10DLynch)
[20:18:06] <wikibugs>	 (03Merged) 10jenkins-bot: Edit check: log to VEFU if a tone check would have been shown if not for the a/b test [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1184121 (https://phabricator.wikimedia.org/T394952) (owner: 10DLynch)
[20:18:36] <logmsgbot>	 !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1184115|Edit check: set up the tone check a/b test (T389231 T402195)]], [[gerrit:1184121|Edit check: log to VEFU if a tone check would have been shown if not for the a/b test (T394952)]]
[20:18:42] <stashbot>	 T389231: Deploy config change to start the Tone Check A/B Test - https://phabricator.wikimedia.org/T389231
[20:18:43] <stashbot>	 T402195: Improve edit check a/b test configuration to cope with multiple tests running side by side - https://phabricator.wikimedia.org/T402195
[20:18:43] <stashbot>	 T394952: Log edits when Tone Check would've been shown had someone not been in control group - https://phabricator.wikimedia.org/T394952
[20:19:06] <wikibugs>	 (03PS1) 10Lucas Werkmeister: Revert "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184161 (https://phabricator.wikimedia.org/T362324)
[20:19:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184161 (https://phabricator.wikimedia.org/T362324) (owner: 10Lucas Werkmeister)
[20:19:58] <lucaswerkmeister>	 ^ if we have time, I’d love to get this deployed (cc MatmaRex)
[20:20:14] <MatmaRex>	 👍
[20:24:49] <logmsgbot>	 !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1184115|Edit check: set up the tone check a/b test (T389231 T402195)]], [[gerrit:1184121|Edit check: log to VEFU if a tone check would have been shown if not for the a/b test (T394952)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:24:55] <stashbot>	 T389231: Deploy config change to start the Tone Check A/B Test - https://phabricator.wikimedia.org/T389231
[20:24:55] <stashbot>	 T402195: Improve edit check a/b test configuration to cope with multiple tests running side by side - https://phabricator.wikimedia.org/T402195
[20:24:56] <stashbot>	 T394952: Log edits when Tone Check would've been shown had someone not been in control group - https://phabricator.wikimedia.org/T394952
[20:25:37] <logmsgbot>	 !log kemayo@deploy1003 kemayo: Continuing with sync
[20:25:54] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[20:27:06] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:27:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Reimage sretest2009 as a wikikube worker and assess performance - https://phabricator.wikimedia.org/T400871#11141418 (10Jhancock.wm) Hi, checking is to see if I can remove the ops-codfw tag? I'm cleaning up our board. Are you using the tag to organize in some way...
[20:28:22] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] Revert "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184161 (https://phabricator.wikimedia.org/T362324) (owner: 10Lucas Werkmeister)
[20:29:22] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] hcaptcha: Respond with HTTP 405 to disallowed methods [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis)
[20:30:27] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] hcaptcha: Redirect / to mw.o project page [puppet] - 10https://gerrit.wikimedia.org/r/1184157 (owner: 10BryanDavis)
[20:31:11] <logmsgbot>	 !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184115|Edit check: set up the tone check a/b test (T389231 T402195)]], [[gerrit:1184121|Edit check: log to VEFU if a tone check would have been shown if not for the a/b test (T394952)]] (duration: 12m 34s)
[20:31:17] <stashbot>	 T389231: Deploy config change to start the Tone Check A/B Test - https://phabricator.wikimedia.org/T389231
[20:31:17] <stashbot>	 T402195: Improve edit check a/b test configuration to cope with multiple tests running side by side - https://phabricator.wikimedia.org/T402195
[20:31:17] <stashbot>	 T394952: Log edits when Tone Check would've been shown had someone not been in control group - https://phabricator.wikimedia.org/T394952
[20:31:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling needed between cages to eqiad 2025/6 switch refresh - https://phabricator.wikimedia.org/T402432#11141430 (10wiki_willy) a:03Jclark-ctr
[20:31:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183154 (https://phabricator.wikimedia.org/T403127) (owner: 10Jdlrobson)
[20:32:15] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P82427 and previous config saved to /var/cache/conftool/dbconfig/20250902-203212-ladsgroup.json
[20:34:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11141445 (10wiki_willy) a:03VRiley-WMF
[20:35:31] <wikibugs>	 (03PS2) 10JHathaway: acme_chief: purge old certs [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858)
[20:35:53] <wikibugs>	 (03CR) 10JHathaway: "good idea, added" [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway)
[20:36:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] acme_chief: purge old certs [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway)
[20:37:23] <wikibugs>	 (03PS3) 10JHathaway: acme_chief: purge old certs [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858)
[20:38:25] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway)
[20:45:15] <wikibugs>	 (03Merged) 10jenkins-bot: Restore ext.visualEditor.track module [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183154 (https://phabricator.wikimedia.org/T403127) (owner: 10Jdlrobson)
[20:45:39] <logmsgbot>	 !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1183154|Restore ext.visualEditor.track module (T403127)]]
[20:45:44] <stashbot>	 T403127: VisualEditor is loading oojs-ui on desktop page load - https://phabricator.wikimedia.org/T403127
[20:47:23] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P82428 and previous config saved to /var/cache/conftool/dbconfig/20250902-204722-ladsgroup.json
[20:51:31] <logmsgbot>	 !log kemayo@deploy1003 jdlrobson, kemayo: Backport for [[gerrit:1183154|Restore ext.visualEditor.track module (T403127)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:51:34] <stashbot>	 T403127: VisualEditor is loading oojs-ui on desktop page load - https://phabricator.wikimedia.org/T403127
[20:52:43] <logmsgbot>	 !log kemayo@deploy1003 jdlrobson, kemayo: Continuing with sync
[20:57:54] <stephanebisson>	 jouncebot next
[20:57:54] <jouncebot>	 In 0 hour(s) and 2 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T2100)
[20:57:59] <logmsgbot>	 !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183154|Restore ext.visualEditor.track module (T403127)]] (duration: 12m 20s)
[20:58:02] <stashbot>	 T403127: VisualEditor is loading oojs-ui on desktop page load - https://phabricator.wikimedia.org/T403127
[20:58:25] <Kemayo>	 Technically I have one more I could deploy, but if someone else wants to get something in then I don't mind.
[20:58:36] <Kemayo>	 Web doesn't exist any more, after all, so that window should be free.
[20:58:46] <lucaswerkmeister>	 I would love to get my config change deployed, it hopefully fixes a regression in several tools
[20:58:59] <Kemayo>	 Go for it.
[20:59:20] <stephanebisson>	 Mine can be done a little later but I would love to squeeze it in the next hour
[20:59:35] <jan_drewniak>	 Kemayo: o/ I'm planning on using the web deployment window today but ya'll can finish any backports first. 
[20:59:51] <Lucas_WMDE>	 ok I guess I’ll deploy “my” config change then
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T2100)
[21:00:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184161 (https://phabricator.wikimedia.org/T362324) (owner: 10Lucas Werkmeister)
[21:00:32] * lucaswerkmeister tries to put together an X-Wikimedia-Debug compatible test in the meantime
[21:01:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Set $wgPHPSessionHandling to 'disable' on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184161 (https://phabricator.wikimedia.org/T362324) (owner: 10Lucas Werkmeister)
[21:01:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1184161|Revert "Set $wgPHPSessionHandling to 'disable' on group1 wikis" (T362324 T403519)]]
[21:01:42] <stashbot>	 T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324
[21:01:42] <stashbot>	 T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519
[21:02:30] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T402925)', diff saved to https://phabricator.wikimedia.org/P82429 and previous config saved to /var/cache/conftool/dbconfig/20250902-210229-ladsgroup.json
[21:02:33] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[21:02:35] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance
[21:02:52] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[21:03:00] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1221 (T402925)', diff saved to https://phabricator.wikimedia.org/P82430 and previous config saved to /var/cache/conftool/dbconfig/20250902-210259-ladsgroup.json
[21:03:53] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11141597 (10phaultfinder)
[21:06:20] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, lucaswerkmeister: Backport for [[gerrit:1184161|Revert "Set $wgPHPSessionHandling to 'disable' on group1 wikis" (T362324 T403519)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:06:24] <lucaswerkmeister>	 testing
[21:06:47] <lucaswerkmeister>	 hm, nothing so far… let me try logging out and back in
[21:06:56] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:08:52] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11141606 (10phaultfinder)
[21:08:56] <lucaswerkmeister>	 still nothing
[21:09:05] <lucaswerkmeister>	 but I’m not confident I’m sending the XWD header correctly
[21:09:25] <lucaswerkmeister>	 so I think I’ll go ahead with the deployment anyway
[21:09:30] <Lucas_WMDE>	 ok
[21:09:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, lucaswerkmeister: Continuing with sync
[21:09:44] <MatmaRex>	 :D
[21:10:34] <lucaswerkmeister>	 (I put `self.session.headers['X-Wikimedia-Debug'] = 'backend=k8s-mwdebug'` in the Runner’s __post__init() fwiw)
[21:10:58] <lucaswerkmeister>	 (I don’t think I can easily get mwapi to show me the response headers that would indicate the actual server)
[21:11:03] <Jdlrobson>	 jan_drewniak: there is no deployments after ours so I guess we can go late if needed. You want to meet early?
[21:13:08] <MatmaRex>	 lucaswerkmeister: there is also &servedby=1 which will add a field to the API response body
[21:13:16] <lucaswerkmeister>	 ooh
[21:13:50] <lucaswerkmeister>	 'servedby': 'mw-debug.eqiad.pinkunicorn-7447bd958c-k6bw8'
[21:13:51] <lucaswerkmeister>	 hm
[21:13:57] <lucaswerkmeister>	 sounds like it doesn’t fix the issue then 😔
[21:14:10] <lucaswerkmeister>	 (thanks anyway, I’ll try to remember that parameter ^^)
[21:14:32] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:14:44] <jan_drewniak>	 Jdlrobson: I have a portal banner to deploy after both Lucas_WMDE and lucaswerkmeister are done their deployment :P we can meet once I get that started.
[21:14:57] <Lucas_WMDE>	 should be done in a moment :P
[21:15:00] <MatmaRex>	 no problem, we can reenable that once we figure out the current issue
[21:15:03] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184161|Revert "Set $wgPHPSessionHandling to 'disable' on group1 wikis" (T362324 T403519)]] (duration: 13m 27s)
[21:15:04] <lucaswerkmeister>	 ^ what Lucas_WMDE said
[21:15:08] <stashbot>	 T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324
[21:15:08] <stashbot>	 T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519
[21:15:10] * Lucas_WMDE done deploying
[21:15:23] <Lucas_WMDE>	 (since MatmaRex doesn’t need an immediate revert)
[21:15:26] <Lucas_WMDE>	 jan_drewniak: over to you
[21:15:35] <jan_drewniak>	 thank you!
[21:15:44] <wikibugs>	 (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184159 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[21:15:46] <MatmaRex>	 we've been working on session handling code recently, one of the changes must have affected this somehow
[21:16:11] <wikibugs>	 (03CR) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez)
[21:16:30] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184159 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[21:16:39] <lucaswerkmeister>	 but it’s strange that it happens on wikidatawiki (group1) and not enwiki (group2) today, since they’re on the same train version
[21:17:21] <lucaswerkmeister>	 MatmaRex: plot twist, Mahir256 reports something started working again
[21:17:25] * lucaswerkmeister tests my tools some more
[21:18:14] <lucaswerkmeister>	 ok “real” QuickCategories also works again
[21:18:15] <MatmaRex>	 if this is related to sessions, it might help to "log out" your tools and log in again
[21:18:18] <lucaswerkmeister>	 no idea why my localhost test still has the issue
[21:18:36] <lucaswerkmeister>	 I tried to completely log out during the mwdebug phase (Special:UserLogout and revoke on Special:OAuthManageMyGrants)
[21:18:51] <lucaswerkmeister>	 (and discard the session in the tool to repeat the OAuth authorization)
[21:19:58] <lucaswerkmeister>	 I mean, I’m happy my tool is alive again, I guess :D
[21:20:22] <lucaswerkmeister>	 guess we’re somehow still using PHP session handling after all (perhaps only in OAuth?)
[21:21:48] <MatmaRex>	 these are all oauth 1 apps?
[21:21:57] <lucaswerkmeister>	 wait, it’s obvious. my localhost test is against testwiki
[21:22:00] <lucaswerkmeister>	 i.e. group0
[21:22:03] <lucaswerkmeister>	 that revert only fixed group1 :D
[21:22:29] <lucaswerkmeister>	 MatmaRex: as far as I’m aware at least, though I’m not sure about all of them (e.g. https://github.com/maxlath/wikibase-cli/issues/192 doesn’t say which oauth version or consumer)
[21:23:59] <MatmaRex>	 lucaswerkmeister: in some unit tests we've seen cases where PHPSessionHandler magically transported values between different instances of WebRequest / mocked SessionManager. i wonder if in some cases this also happens in real code
[21:24:28] <lucaswerkmeister>	 jouncebot: nowandnext
[21:24:28] <jouncebot>	 For the next 0 hour(s) and 35 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250902T2100)
[21:24:28] <jouncebot>	 In 8 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250903T0600)
[21:24:32] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 9.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:25:15] <lucaswerkmeister>	 jan_drewniak: can you ping me when you’re done? I’d like to do another revert (but it’s a bit less urgent)
[21:25:32] <lucaswerkmeister>	 also stephanebisson ebernhardson, did you still want to deploy? sorry for butting in before you
[21:26:36] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:27:43] <stephanebisson>	 No worries, I'll reschedule.
[21:28:02] <MatmaRex>	 lucaswerkmeister: you want to revert it on group0 as well?
[21:28:11] <lucaswerkmeister>	 I’d do that, yeah
[21:28:15] <lucaswerkmeister>	 unless you’re against it?
[21:28:21] <lucaswerkmeister>	 but there’s plenty of non-test wikis in that group
[21:28:34] <MatmaRex>	 no, i think reverting is okay
[21:28:39] <MatmaRex>	 just leave us testwiki for testing :)
[21:28:41] <lucaswerkmeister>	 we could leave it on on testwiki if that helps
[21:28:43] <lucaswerkmeister>	 jinx ^^
[21:28:52] <lucaswerkmeister>	 I’ll just need to retarget my test for test2wiki then ^^
[21:29:00] <lucaswerkmeister>	 ok lemme put together the config change
[21:29:33] <logmsgbot>	 !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1184159| Bumping portals to master (T128546)]] (duration: 11m 18s)
[21:29:36] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[21:29:37] <lucaswerkmeister>	 ah, a plain revert already leaves it disabled on testwiki, I don’t even need to change stuff
[21:29:38] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[21:29:54] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:30:24] <wikibugs>	 (03PS1) 10Lucas Werkmeister: Revert "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184166 (https://phabricator.wikimedia.org/T362324)
[21:31:11] <lucaswerkmeister>	 dangit, test2wiki is in group1, I’ll need to test on another group0 wiki
[21:31:34] <logmsgbot>	 !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1184159| Bumping portals to master (T128546)]] (duration: 01m 59s)
[21:32:56] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:33:08] <MatmaRex>	 lucaswerkmeister: i'll be off for tonight, i'll try to investigate this tomorrow (or maybe someone else will, i'll write in our team channel). to reproduce this, i should be able to use QuickCategories against test.wikipedia.org? i've never seen that tool before, but i can probably figure it out
[21:33:19] <lucaswerkmeister>	 yes
[21:33:31] <lucaswerkmeister>	 input could look like:
[21:33:31] <lucaswerkmeister>	 User:Lucas Werkmeister/sandbox|+Category:Testing T403519
[21:33:33] <stashbot>	 T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519
[21:33:52] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:34:01] <lucaswerkmeister>	 for testing, please don’t use “background” mode, as that would then break the background runner across the whole tool :)
[21:34:02] <wikibugs>	 (03PS3) 10Jdlrobson: Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944
[21:34:14] <lucaswerkmeister>	 so use the “Run these commands” button instead (that’s foreground mode)
[21:36:29] <MatmaRex>	 lucaswerkmeister: mind if i copy-paste to the task?
[21:36:35] <lucaswerkmeister>	 sure
[21:36:42] <lucaswerkmeister>	 I was also thinking of leaving a comment to that effect later
[21:36:45] <lucaswerkmeister>	 I can also do it now while I wait ^^
[21:37:03] <MatmaRex>	 oh, yeah, please do. thank you
[21:37:12] <MatmaRex>	 and sorry for breaking it D:
[21:39:58] <Jdlrobson>	 lucaswerkmeister: i just need to do one deploy in our window then can pass back to you
[21:40:03] <lucaswerkmeister>	 ok
[21:40:33] <Lucas_WMDE>	 MatmaRex: just before you leave – I’m on holiday from Thursday, so I won’t be able to run the maintenance script from T398177 for you (it wouldn’t finish in time)
[21:40:34] <stashbot>	 T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177
[21:40:43] <Lucas_WMDE>	 just in case you were hoping to start running that tomorrow :)
[21:40:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182875 (https://phabricator.wikimedia.org/T402208) (owner: 10Bernard Wang)
[21:41:14] <Lucas_WMDE>	 (you can still run it, you just need to find someone else who’ll still be around when it finishes ^^)
[21:41:26] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:41:43] <wikibugs>	 (03Merged) 10jenkins-bot: Remove deprecated search config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182875 (https://phabricator.wikimedia.org/T402208) (owner: 10Bernard Wang)
[21:42:08] <logmsgbot>	 !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1182875|Remove deprecated search config (T402208)]]
[21:42:12] <stashbot>	 T402208: Remove old search config - https://phabricator.wikimedia.org/T402208
[21:43:43] <MatmaRex>	 Lucas_WMDE: sure, no problem
[21:44:36] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[21:48:04] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson, bwang: Backport for [[gerrit:1182875|Remove deprecated search config (T402208)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:48:08] <stashbot>	 T402208: Remove old search config - https://phabricator.wikimedia.org/T402208
[21:48:58] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson, bwang: Continuing with sync
[21:50:31] <Jdlrobson>	 Lucas_WMDE: over to you
[21:51:21] <Lucas_WMDE>	 thanks!
[21:51:35] <Lucas_WMDE>	 hang on, spiderpig’s still running ^^
[21:51:40] <Lucas_WMDE>	 I guess I’ll go as soon as it’s done
[21:54:14] <logmsgbot>	 !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182875|Remove deprecated search config (T402208)]] (duration: 12m 06s)
[21:54:18] <stashbot>	 T402208: Remove old search config - https://phabricator.wikimedia.org/T402208
[21:54:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184166 (https://phabricator.wikimedia.org/T362324) (owner: 10Lucas Werkmeister)
[21:54:39] <Lucas_WMDE>	 let’s go
[21:55:29] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Set $wgPHPSessionHandling to 'disable' on group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184166 (https://phabricator.wikimedia.org/T362324) (owner: 10Lucas Werkmeister)
[21:55:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1184166|Revert "Set $wgPHPSessionHandling to 'disable' on group0 wikis" (T362324 T403519)]]
[21:55:59] <stashbot>	 T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324
[21:55:59] <stashbot>	 T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519
[21:58:56] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:00:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister, lucaswerkmeister-wmde: Backport for [[gerrit:1184166|Revert "Set $wgPHPSessionHandling to 'disable' on group0 wikis" (T362324 T403519)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:00:25] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:00:33] <lucaswerkmeister>	 woooh https://test.wikidata.org/w/index.php?oldid=737809&diff=737810
[22:00:41] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister, lucaswerkmeister-wmde: Continuing with sync
[22:01:16] <wikibugs>	 (03PS1) 10Cwhite: airflow: disable icinga nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/1184169 (https://phabricator.wikimedia.org/T384214)
[22:01:18] <wikibugs>	 (03PS1) 10Cwhite: hiera: disable monitoring for legacy profile::airflow::instances [puppet] - 10https://gerrit.wikimedia.org/r/1184170 (https://phabricator.wikimedia.org/T384214)
[22:01:20] <wikibugs>	 (03PS1) 10Cwhite: airflow: remove nrpe definitions [puppet] - 10https://gerrit.wikimedia.org/r/1184171 (https://phabricator.wikimedia.org/T384214)
[22:01:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] airflow: disable icinga nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/1184169 (https://phabricator.wikimedia.org/T384214) (owner: 10Cwhite)
[22:03:01] <wikibugs>	 (03PS2) 10Cwhite: airflow: disable icinga nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/1184169 (https://phabricator.wikimedia.org/T384214)
[22:04:58] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[22:05:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1184166|Revert "Set $wgPHPSessionHandling to 'disable' on group0 wikis" (T362324 T403519)]] (duration: 10m 03s)
[22:06:02] <stashbot>	 T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324
[22:06:02] <stashbot>	 T403519: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519
[22:06:09] * Lucas_WMDE done deploying
[22:06:36] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:06:40] <Lucas_WMDE>	 !log UTC late backport+config window (belatedly) done
[22:06:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:06:45] <perryprog>	 I'm not saying I did just now get an "invalid response from server" on trying to save an edit like seconds ago when that deploy would've finished, but I maybe sorta did
[22:06:49] <perryprog>	 it went away on reload and trying again tho
[22:08:54] <icinga-wm>	 PROBLEM - snapshot of x1 in eqiad on backupmon1001 is CRITICAL: Last snapshot for x1 at eqiad (db1216) taken on 2025-09-02 21:37:14 is 308 GiB, but the previous one was 386 GiB, a change of -20.2 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[22:09:32] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:11:24] <lucaswerkmeister>	 hmmmm
[22:11:27] <lucaswerkmeister>	 that’s definitely not troubling at all
[22:12:29] <perryprog>	 definitely not
[22:14:52] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:17:56] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:28:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:30:46] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:31:18] <lucaswerkmeister>	 perryprog: any further errors?
[22:31:26] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:31:51] <perryprog>	 Nothing; I think you're safe
[22:32:03] <perryprog>	 maybe ant got in my computer or something
[22:32:13] <lucaswerkmeister>	 phew, thanks ^^
[22:32:39] <Lucas_WMDE>	 logspam-watch looks mostly okay fwiw, though there’s a spike of Stats: Label value cannot be empty.
[22:32:54] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[22:33:17] <Lucas_WMDE>	 though it looks like that also happened earlier today already, nevermind
[22:33:53] <Lucas_WMDE>	 (I guess that’s T403512)
[22:33:53] <stashbot>	 T403512: PHP Warning: Stats: (RateLimiter_limit_actions_total) Stats: Label value cannot be empty. - https://phabricator.wikimedia.org/T403512
[22:34:22] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:35:24] * Lucas_WMDE afk, if you need me ping me elsewhere
[22:44:41] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:49:55] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:50:22] <perryprog>	 lucaswerkmeister, happening again. Console errors: https://phabricator.wikimedia.org/P82432
[22:50:39] <perryprog>	 I'm not a fan of how the errors have to do with session IDs...
[22:53:34] <lucaswerkmeister>	 oh no
[22:55:08] * Lucas_WMDE looks at client errors logstash
[22:56:19] <Lucas_WMDE>	 nothing at all in logstash, so I guess this doesn’t get logged
[22:57:01] <perryprog>	 we have a client logstash? Good chance my content blockers could hit it, though if there isn't anything it's surely not widespread
[22:57:17] <perryprog>	 plus successful edit rate seems fine
[22:57:35] <lucaswerkmeister>	 but I think this has to be a bug somewhere in https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaEvents/+/82701978ceff97a0e24fcc73f0303afb4a1cfcb1/modules/ext.wikimediaEvents/editAttemptStep.js
[22:57:43] <lucaswerkmeister>	 (only codesearch result for session.editing_session_id)
[22:57:54] <lucaswerkmeister>	 totally different kind of session AFAICT
[22:58:20] <lucaswerkmeister>	 perryprog: yeah, it’s been around for a few years I think; but it’s possible it gets blocked (or respects do-not-track or something), I don’t know
[22:58:41] <perryprog>	 neat!
[22:58:59] <lucaswerkmeister>	 https://wikitech.wikimedia.org/wiki/Client_errors
[22:59:21] <wikibugs>	 (03PS2) 10Jdlrobson: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180689
[23:00:06] <wikibugs>	 (03Abandoned) 10Jdlrobson: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180689 (owner: 10Jdlrobson)
[23:00:42] <lucaswerkmeister>	 perryprog: hang on, on which wiki are you seeing these errors
[23:00:46] <perryprog>	 enwiki
[23:01:00] <perryprog>	 oh! That was group0!
[23:01:02] <lucaswerkmeister>	 then I’m fairly confident it’s not due to those config changes
[23:01:07] <lucaswerkmeister>	 they were group1 and then group0 yeah
[23:01:23] <perryprog>	 okay word—I think I got group0 and group2 backwards
[23:01:27] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] Send email alerts to Reading Web "Performance Alert" Slack channel [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298) (owner: 10Jdlrobson)
[23:01:28] <lucaswerkmeister>	 it’s *possible* that the session handler has some spooky action at a distance, but I think it’s more likely that these errors are unrelated
[23:01:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:04:36] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[23:09:06] <wikibugs>	 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on an-worker1141:9290 - https://phabricator.wikimedia.org/T403561 (10phaultfinder) 03NEW
[23:09:07] <wikibugs>	 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on an-worker1141:9290 - https://phabricator.wikimedia.org/T403562 (10phaultfinder) 03NEW
[23:10:49] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:10:56] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:19:32] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:21:08] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T402925)', diff saved to https://phabricator.wikimedia.org/P82433 and previous config saved to /var/cache/conftool/dbconfig/20250902-232107-ladsgroup.json
[23:21:12] <stashbot>	 T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925
[23:21:36] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:28:24] <wikibugs>	 (03CR) 10Jdlrobson: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1184113 (https://phabricator.wikimedia.org/T392298) (owner: 10Jdlrobson)
[23:28:56] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:29:30] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 8.505 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:29:54] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:31:26] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:36:15] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P82434 and previous config saved to /var/cache/conftool/dbconfig/20250902-233615-ladsgroup.json
[23:38:01] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184178
[23:38:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184178 (owner: 10TrainBranchBot)
[23:51:22] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P82435 and previous config saved to /var/cache/conftool/dbconfig/20250902-235121-ladsgroup.json
[23:52:29] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1184178 (owner: 10TrainBranchBot)
[23:53:56] <jinxer-wm>	 FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[23:55:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources